Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jan-Feb;24(1):62-73.
doi: 10.1021/bp070134h. Epub 2007 Nov 17.

Computationally mapping sequence space to understand evolutionary protein engineering

Affiliations

Computationally mapping sequence space to understand evolutionary protein engineering

Kathryn A Armstrong et al. Biotechnol Prog. 2008 Jan-Feb.

Abstract

Evolutionary protein engineering has been dramatically successful, producing a wide variety of new proteins with altered stability, binding affinity, and enzymatic activity. However, the success of such procedures is often unreliable, and the impact of the choice of protein, engineering goal, and evolutionary procedure is not well understood. We have created a framework for understanding aspects of the protein engineering process by computationally mapping regions of feasible sequence space for three small proteins using structure-based design protocols. We then tested the ability of different evolutionary search strategies to explore these sequence spaces. The results point to a non-intuitive relationship between the error-prone PCR mutation rate and the number of rounds of replication. The evolutionary relationships among feasible sequences reveal hub-like sequences that serve as particularly fruitful starting sequences for evolutionary search. Moreover, genetic recombination procedures were examined, and tradeoffs relating sequence diversity and search efficiency were identified. This framework allows us to consider the impact of protein structure on the allowed sequence space and therefore on the challenges that each protein presents to error-prone PCR and genetic recombination procedures.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(A) Crystal structures of the bovine pancreatic trypsin inhibitor (BPTI), pin1 WW domain (WW), and an immunoglobulin G binding domain from streptococcal protein G (B-domain) (PDB codes 1BPI, 1F8A, and 1IGD at 1.09, 1.84, and 1.10 Å resolution, respectively (–48)). Core residues allowed to mutate are colored blue. (B) Sequence-space graphs for BPTI, WW, and B-domain with the five largest components colored in red, orange, yellow, green, and blue, in decreasing size order. The remaining vertices are colored purple. Component sizes are given in Table 2.
Figure 2
Figure 2
Degree distributions for the sequence space graphs shown in Figure 1B. The degree of a node is the number of edges touching that node in its sequence-space graph.
Figure 3
Figure 3
Histograms of the number of amino-acid mutations per sequence, after 9, 12, 15, or 18 rounds of error-prone PCR at five different DNA mutation probabilities. The mean and standard error bars are shown for simulations on 10,000 starting sequences with 17 amino-acid residues.
Figure 4
Figure 4
Variation of the number of unique sequences identified by search as a function of the mutation rate and the number of rounds of error-prone PCR. All simulations started with sequences computed to be foldable for the WW-domain. Error-prone PCR simulations were performed from random single starting sequences computed to be foldable, at 5 different mutation rates and for 9, 12, 15, and 18 rounds of error-prone PCR. All results are mean per trial over 10,000 trials. (A) The number of unique protein sequences generated. (B) The fraction of unique protein sequences within the total number of sequences generated. (Peaks still exist in these curves at a mutation rate of 0.01 but are not visible due to the log scale.) (C) The number of unique protein sequences seen after a screening limit of 105 sequences is imposed.
Figure 5
Figure 5
(A) The number of unique protein sequences generated after 15 rounds of error-prone PCR simulation with the results recorded separately by the degree of the starting sequence in its sequence-space graph. Standard error bars over 10,000 trials are shown. (B–D) The number of functional protein sequences generated after 15 rounds of error-prone PCR simulation. The fraction of the foldable sequence space that is functional is approximately 0.01, 0.1, and 0.3 for plots B, C, and D, respectively. The results are recorded separately by the degree of the starting sequence in its sequence-space graph, and standard deviation bars over 1000 random selections of functional sequences over 10,000 error-prone PCR trials are shown.
Figure 6
Figure 6
Evolvability of sequence pairs with one mutation and different energies. (A) For pairs of sequences with a mutation at only one position, we asked whether the sequence with the lower or higher computed energy was more evolvable. This evolvability is defined as the number of sequences remaining in the sequence-space graph when the variable position in the pair is constrained. This was done for all pairs of sequences found in the sequence-space graphs that had only a single mutation between them, and the results are separated by each core position in order by the residue number in the structure. (B) In BPTI the core positions are 4, 6, 10, 18, 20, 21, 22, 23, 24, 25, 33, 35, 36, 43, 44, 45, and 47. (C) In WW-domain the core positions are 9, 11, 15, 17, 20, 23, 27, 28, 29, 36, 37, and 38. (D) In B-domain the core positions are 8, 10, 12, 14, 31, 35, 39, 44, 48, 57, and 59.
Figure 7
Figure 7
The distribution of unique foldable sequences found in each evolutionary simulation as a function of the number of mutations from the starting sequence. (A) Fraction of unique foldable sequences found in each simulation of 15 rounds of error-prone PCR with a mutation probability of 0.01 per base-pair per generation. Each simulation started with one sequence and after each trial the distance of each unique foldable resulting sequence to this one sequence was computed. (B) Fraction of unique foldable sequences found in each genetic recombination simulation started with 2 randomly-chosen foldable sequences. These sequences were cut once and recombined to produce 2 sequences. For each of the two resulting sequences, the minimum distance from each resulting sequence to a starting sequence was computed. (C) Fraction of unique foldable sequences found in each genetic recombination simulation started with 3 randomly chosen foldable sequences. These sequences were cut twice at random and recombined such that 6 sequences were produced. These 6 sequences contained all possible combinations of the three starting sequences such that some fragment from each starting sequence was part of each resulting sequence. The minimum distance from each resulting sequence to the closest starting sequence was computed.
Figure 8
Figure 8
Genetic recombination results at each distance between parent sequences. Each genetic recombination trial starts with two parent sequences computed to be foldable and sequences were cut once at a random location and recombined. The number of sequences seen per trial were plotted at each DNA distance per residue between the parent sequences. Standard error bars over 10,000 trials are shown. (A) Only sequences that were computed to be foldable and were different from either parent sequence are included. (B) Only sequences that were computed to be foldable, were different from either parent sequence, and were in a different graph component from that of either parent sequence are included.

References

    1. Bloom JD, Meyer MM, Meinhold P, Otey CR, MacMillan D, Arnold FH. Evolving strategies for enzyme engineering. Curr. Opin. Struct. Biol. 2005;15:447–452. - PubMed
    1. Reetz MT. Controlling the enantioselectivity of enzymes by directed evolution: Practical and theoretical ramifications. Proc. Natl. Acad. Sci. U. S. A. 2004;101:5716–5722. - PMC - PubMed
    1. Levin AM, Weiss GA. Optimizing the affinity and specificity of proteins with molecular display. Mol. Biosyst. 2006;2:49–57. - PubMed
    1. Baker D. Prediction and design of macromolecular structures and interactions. Philos. T. Roy. Soc. B. 2006;361:459–463. - PMC - PubMed
    1. Butterfoss GL, Kuhlman B. Computer-based design of novel protein structures. Annu. Rev. Biophys. Biomol. Struct. 2006;35:49–65. - PubMed

Publication types