FIGuRE 4.1 Major classes of library-construction methods used in protein engineering. The black and gray lines represent polypeptide sequences of the starting clone or clones, and the bullets represent mutations. (A) Random mutagenesis. (B) Recombination. (C) Site-directed diversification. (D) Scanning mutagenesis.

Scanning mutagenesis (Figure 4.1D), formally a type of site-directed diversification, can be used either to generate improved protein variants or to collect sequence-function correlations that will inform the construction of more complex and diverse libraries. Finally, libraries in which diversity is derived from naturally diverse gene sets present in an organism, tissue, or a complex environmental sample make possible the identification of naturally occurring proteins with desired phenotype.

Whereas the simplest examples of the major classes of library-construction methods are easy to distinguish, more sophisticated versions of these methods share underlying concepts and methodology with each other. Some of the most successful diversity-oriented applications of protein engineering have exploited several different types of libraries. All of these library-construction methods will be discussed in detail below.

Random Mutagenesis

In this family of methods, the gene that encodes a starting protein is modified by introducing relatively random mutations—substitutions, deletions, and insertions— at random positions in the gene. This method mimics the introduction of such errors over the millions of years of natural evolution, but vastly increases the rate of mutagenesis by artificially increasing the error rate of DNA replication.

The earliest random-mutagenesis methods obtained the increase in error rate by exposing dividing cells to mutagenic conditions such as UV light, x-ray radiation, or chemical mutagens (Doudney and Haas 1959; Ong and De Serres 1972; Myers et al. 1985), or by propagating the gene of interest in mutator cell strains deficient in DNA repair (Cox 1976; Low et al. 1996; Nguyen and Daugherty 2003). In the last

20 years, these methods were surpassed by polymerase chain reaction (PCR) performed under conditions of reduced replication fidelity (Leung et al. 1989; Cadwell and Joyce 1994), which has the advantage of speed, technical simplicity, and specificity to the gene of interest.

Any gene that can be amplified by standard PCR can also be randomized using error-prone PCR by changing buffer composition, manipulating ratios of free nucle-otides, adding unnatural nucleotide analogs, or using polymerase mutants with a high propensity for incorporating errors (Cadwell and Joyce 1994; Zaccolo et al. 1996; Cirino et al. 2003). The mutagenesis rate can be manipulated by fine-tuning PCR conditions and the number of mutagenic PCR cycles, and can reach the rate of one mutation per five base pairs (Zaccolo and Gherardi 1999). Preassembled error-prone PCR kits that reliably mutagenize DNA at a set rate are available commercially from Clontech and Stratagene.

Since error-prone PCR introduces mutations throughout the DNA sequence being amplified with little or no positional bias, it is particularly well suited to applications where no information is available to direct diversification to particular "hot spots" in the sequence, positions where mutations would be the most likely to affect the phenotype of interest. Error-prone PCR tuned for a low rate of mutagen-esis is commonly used to discover such hot spots (Takase et al. 2003; Hamamatsu et al. 2006). It is important to keep in mind that, while error-prone PCR is an excellent way to scan the length of the gene for promising positions to mutagen-ize, the mutations that it introduces are biased, especially at the protein level. This is because many amino-acid mutations require two or even three nucleotide mutations per codon, whereas error-prone PCR is most likely to introduce one nucleotide mutation per codon. Depending on the wild-type codon, only between three and seven amino-acid substitutions per amino acid are typically achieved by error-prone PCR (Wong et al. 2006), and the other twelve to sixteen substitutions are not sampled at all. Similarly, beneficial double or multiple mutations are sampled very rarely. In theory, the rate of occurrence of beneficial multiple mutations could be increased by increasing the overall mutagenesis rate, but in practice that is risky. Given that a random mutation is much more likely to be detrimental than beneficial, increasing the mutagenesis rate increases the risk of any beneficial multiple mutations being obscured by deleterious mutations occurring in the same clone. As a consequence, high mutagenesis rates are best combined with a recombination approach, described in the next section, that redistributes mutations in a wild-type background.


This family of methods mimics a second mechanism of natural evolution: the exchange of pieces between related genes using homologous recombination. In recombination-based library construction, two or more related starting genes are recombined, resulting in a library of variant genes with new combinations of sequences that were present in the starting gene. The starting genes can be either naturally occurring, closely related members of the same gene family or a mixture of naturally occurring genes and mutants generated in vitro, often by error-prone PCR.

The first in vitro recombination method described, DNA shuffling (Stemmer 1994a; Stemmer 1994b), is simple, powerful, and still widely used. Here the starting genes are randomly fragmented using DNAse I, and the fragments are annealed and reassembled using PCR. Highly homologous but nonidentical fragments from different parental genes cross-anneal and are extended into longer, chimeric fragments, eventually leading to the construction of full-length novel genes that contain sequences from two or more different sources.

When compared to library-generation methods that involve random mutagenesis (described in the preceding section) or saturation mutagenesis of defined parts of the starting gene (described in the next section), standard DNA shuffling from related, naturally occurring genes (Crameri et al. 1998) is a relatively conservative method, as it combines pieces of related functional proteins, generating new sequences that have a relatively high probability of being compatible with the desired protein structure and function. As with any method that relies on PCR-like DNA replication, DNA shuffling does incorporate a low level of random mutagenesis due to imperfect fidelity of even high-fidelity PCR. In addition, error-prone PCR can be employed during fragment assembly to add random point mutagenesis to recombination (Stemmer 1994a). Finally, DNA shuffling and other recombination-based methods are an excellent way of counteracting one of the weaknesses of randomization by error-prone PCR: A library with a high density of random mutations introduced by error-prone PCR can be back-crossed by shuffling with excess of the original wildtype gene, thus separating beneficial mutations from deleterious and neutral ones (Stemmer 1994a).

Numerous alternative DNA-recombination methods have been reported:

In staggered extension process (StEP) (Zhao et al. 1998), recombination of genetic information between several starting genes occurs when extension of the growing DNA strand from the first template is interrupted before a full-length gene is copied. The mixture of DNA template and product is denatured, re-annealed, and re-extended, allowing the growing strand to anneal to a different, homologous template, thus combining sequences from two or more templates. Extension time in each cycle, rather than the size of parental fragments in DNA shuffling, controls the frequency of crossover events. Another related method, random chimeragenesis on transient templates (RACHITT) (Coco et al. 2001), uses a full-length, uracil-containing template to assemble complementary, short, single-stranded fragments copied from other templates; the uracil-containing template is eventually degraded. RACHITT has the advantage of allowing high frequency of recombination between genes with low homology.

In contrast to the previously mentioned methods, incremental truncation for the creation of hybrid enzymes (ITCHY) (Lutz et al. 2001a; Ostermeier and Lutz 2003) achieves recombination without relying on hybridization of homologous DNA, and thus can recombine genes with little or no sequence homology. In this method, fragments of two or more gene templates are generated by digesting the templates from the 5' or 3' end, and then ligating the resulting fragments. Whereas this method allows recombination of any genes, it does not guide the recombination between analogous parts of two genes and is thus less likely to generate functional proteins than are homology-guided recombination methods. Elements of ITCHY and DNA shuffling are combined in a method named SCRATCHY (Lutz et al. 2001b).

Whereas DNA shuffling, StEP, RACHITT, and ITCHY generate recombination sites at random homologous sites, several recombination methods have been developed that use synthetic oligonucleotides to guide recombination to a specific site or sites in the starting genes. Degenerate oligonucleotide gene shuffling (DOGS) (Gibbs et al. 2001) achieves this aim by using amplification primers that contain two regions that recognize two different templates. Direct amplification of one template in the first round of PCR results in a fragment that anneals to a fragment amplified from the second template in the second round of PCR. The ultimate control of crossover frequency and location is achieved when a set of synthetic oligos itself encodes amino-acid residues from different parental genes (Ness et al. 2002; Zha et al. 2003). In such a case, no physical template is required, and the library is generated synthetically using a design based on a number of parental sequences. In addition to allowing a complete control of recombination events, the use of synthetic oligonucle-otides allows the introduction of any additional desired point mutations or randomized regions, thus including elements of site-directed diversification (described in the next section). Alternatively, in a method named site-directed chimerageneis (SISDC) (Hiraga and Arnold 2003), synthetic oligonucleotides can be used to introduce into the starting genes tags containing restriction endonuclease sites, which then direct the fragmentation of the starting genes and their reassembly by ligation.

Site-Directed DivERSificATiON

In this collection of methods, diversification is directed to a specific position or set of positions, and the remaining protein sequence is fixed as wild-type. In its classical form, known as site-directed randomization, an oligonucleotide that spans the codon or codons of interest is synthesized in vitro, and each wild-type codon of interest is replaced by a mixture of codons (Georgescu et al. 2003; Steffens and Williams 2007). If only a few clustered codons are being diversified, the oligonucleotide or oli-gonucleotides containing the desired mutations can be incorporated into the starting gene by site-directed mutagenesis, such as the PCR-based method of the Stratagene QuikChange kit (Miyazaki 2003; Zheng et al. 2004). If multiple codons are being diversified, the desired diversified gene product can be assembled from a mixture of constant and diversified oligonucleotides using PCR-based gene assembly (Ho et al. 1989; Stemmer et al. 1995; Bessette et al. 2003), ligation (Hughes et al. 2003), or a combination of the two methods.

The most widely used method of obtaining a site-directed mixture of codons is to synthesize a set of oligonucleotides where the wild-type bases in each codon of interest are replaced by mixtures of nucleotides described as NNN, NNS, or NNK, where N represents an equimolar mixture of A, G, C, and T; S represents an equimo-lar mixture of G and C; and K represents an equimolar mixture of G and T. Each of these three codon patterns encodes a mixture of all 20 naturally occurring amino-acid residues as well as translational stop codons, but the NNS and NNK pattern have the advantage of encoding stop codons at a frequency of 1/32 rather than 3/64 for NNN, thus reducing the proportion of truncated proteins encoded by the library.

Due to the redundancy and coding imbalance of the genetic code, site-saturation libraries using NNN, NNS, or NNK encode amino acids with different probabilities (Figure 4.2), but codon bias of NNS- or NNK-encoded site-directed diversity is much lower than that of error-prone PCR.

The major limitation of NNS- and NNK-encoded diversification is their demand on physical library size. Diversification of n positions can generate 32n possible combinations of codons, and the exponential increase of possible sequences with the number of positions diversified means that the number of possible unique codon combinations defined by a particular design quickly overwhelms the capacity of physical library construction methods or the screening and selection methods to test them (Figure 4.3). An additional pressure on physical library size comes from statistical considerations in sampling: A library must be oversampled threefold to ensure a 95% probability that any one of its unique clones will be included at least once, and oversampled 10- to 25-fold to ensure a greater than 99% probability of capturing the entire library (Bosley and Ostermeier 2005). As a consequence, only libraries with relatively few NNS- or NNK-diversified positions can be sampled thoroughly. As

0 0

Post a comment