Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2015 Mar 7;44(5):1172-239.
doi: 10.1039/c4cs00351a.

Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently

Affiliations
Review

Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently

Andrew Currin et al. Chem Soc Rev. .

Abstract

The amino acid sequence of a protein affects both its structure and its function. Thus, the ability to modify the sequence, and hence the structure and activity, of individual proteins in a systematic way, opens up many opportunities, both scientifically and (as we focus on here) for exploitation in biocatalysis. Modern methods of synthetic biology, whereby increasingly large sequences of DNA can be synthesised de novo, allow an unprecedented ability to engineer proteins with novel functions. However, the number of possible proteins is far too large to test individually, so we need means for navigating the 'search space' of possible protein sequences efficiently and reliably in order to find desirable activities and other properties. Enzymologists distinguish binding (Kd) and catalytic (kcat) steps. In a similar way, judicious strategies have blended design (for binding, specificity and active site modelling) with the more empirical methods of classical directed evolution (DE) for improving kcat (where natural evolution rarely seeks the highest values), especially with regard to residues distant from the active site and where the functional linkages underpinning enzyme dynamics are both unknown and hard to predict. Epistasis (where the 'best' amino acid at one site depends on that or those at others) is a notable feature of directed evolution. The aim of this review is to highlight some of the approaches that are being developed to allow us to use directed evolution to improve enzyme properties, often dramatically. We note that directed evolution differs in a number of ways from natural evolution, including in particular the available mechanisms and the likely selection pressures. Thus, we stress the opportunities afforded by techniques that enable one to map sequence to (structure and) activity in silico, as an effective means of modelling and exploring protein landscapes. Because known landscapes may be assessed and reasoned about as a whole, simultaneously, this offers opportunities for protein improvement not readily available to natural evolution on rapid timescales. Intelligent landscape navigation, informed by sequence-activity relationships and coupled to the emerging methods of synthetic biology, offers scope for the development of novel biocatalysts that are both highly active and robust.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1. Relationship between amino acid sequence, 3D structure (and dynamics) and biocatalytic activity. Implicitly, there is a host in which these manipulations take place (or they may be done entirely in vitro). This is not a major focus of the review. Typically, a directed evolution study concentrates on the relationships between protein sequence, structure and activity, and the usual means for assessing these are outlined (within the boxes). Many methods are available to connect and rationalise these relationships and some examples are shown (grey boxes). Thorough directed evolution studies require understanding of each of these parameters so that the changes in protein function can be rationalised, thereby to allow effective search of the sequence space. The key is to use emerging knowledge from multiple sources to navigate the search spaces that these represent. Although the same principles apply to multi-subunit proteins and protein complexes, most of what is written focuses on single-domain proteins that, like ribonuclease, , can fold spontaneously into their tertiary structures without the involvement of other proteins, chaperones, etc.
Fig. 2
Fig. 2. The essential components of an evolutionary system. At the outset, a starting individual or population is selected, and one or more fitness criteria that reflect the objective of the process are determined. Next, the ability to rank these fitnesses and to select for diversity is created (by breeding individuals with variant sequences, introduced typically by mutation and/or recombination) in a way that tends to favour fitter individuals, this is repeated iteratively until a desired criterion is met.
Fig. 3
Fig. 3. A ‘mind map’ of the contents of this paper; to read this start at “twelve o'clock” and read clockwise.
Fig. 4
Fig. 4. An example of the basic elements of a mixed computational and experimental programme in directed evolution. Implicit are the choice of objective function (e.g. a particular catalytic activity with a certain turnover number) and the starting sequences that might be used with an initial or ‘wild type’ activity from which one can evolve improved variants. The core experimental (blue) and computational (red) aspects are shown as seven steps of an iterative cycle involving the creation and analysis of appropriate protein sequences and their attendant activities. Additional facets that can contribute to the programme are also shown (connected using dotted lines).
Fig. 5
Fig. 5. A fitness landscape and its navigation. The initial or wild-type activity denotes the starting point (initialisation) for a directed evolution study (red circle). Accumulation of mutations that increase activity is represented by four routes to different positions in the landscape. Route 1 successfully increases activity through a series of additive mutations, but becomes stuck in a local optimum. Due to the nature of rugged fitness landscapes some of the shorter paths to a maximum possible (global optimum) fitness (activity) can require movement into troughs before navigating a new higher peak (route 2). Alternatively, one can arrive at the global optimum using longer but typically less steep routes without deep valleys (equivalent over flat ground to neutral mutations – routes 3 and 4).
Fig. 6
Fig. 6. A two-objective optimisation problem, illustrating the non-dominated or Pareto front. In this case we wish to maximise both objectives. Each individual symbol is a candidate solution (i.e. protein sequence), with the filled ones denoting an approximation to the Pareto front.
Fig. 7
Fig. 7. Some evolutionary trajectories of a peptide sequence undergoing mutation. Mutations in the peptide sequence can cause an increase in fitness (e.g. enzyme activity, green), loss of fitness (salmon pink) or no change in fitness (grey). Typically, improved fitness mutations are selected for and subjected to further modification and selection. Neutral mutations keep sequences ‘alive’ in the series, and these can often be required for further improvements in fitness, as shown in steps 2 and 3 of this trajectory.
Fig. 8
Fig. 8. The ‘cycle of knowledge’ in modern directed evolution. Both structure-based design and a more empirical data-driven approach can contribute to the evolution of a protein with improved properties, in a series of iterative cycles.
Fig. 9
Fig. 9. Overview of the different mutagenesis strategies commonly employed to create variant protein libraries. Random methods (pink background) can create the greatest diversity of sequences in an uncontrolled manner. Mutations during error-prone PCR (A) are typically introduced by a polymerase amplifying sequences imperfectly (by being used under non-optimal conditions). In contrast, directed mutagenesis methods (blue background) introduce mutations at defined positions and with a controlled outcome. Site-directed mutagenesis (B) introduces a mutation, encoded by oligonucleotides, onto a template gene sequence in a plasmid. However, gene synthesis (C) can encode mutations on the oligonucleotides used to synthesise the sequence de novo, hence multiple mutations can be introduced simultaneously. X = random mutation, N = controlled mutation. →= PCR primer.
Fig. 10
Fig. 10. A Boston matrix of the different strategies for variant libraries. Methods are identified in terms of the randomness of the mutations they create and the number of residues that can be targeted.
Fig. 11
Fig. 11. Examples of some of the common degenerate codons used in DE studies. A codon containing specific mixed bases is used to encode a particular set of amino acids, ranging from all twenty amino acids (NNN or NNK) to those with particular properties. Hence, choice of degenerate codons to use depends on the design and objective of the study. In the IUPAC terminology K = G/T, M = A/C, R = A/G, S = C/G, W = A/T, Y = C/T, B = C/G/T, D = A/G/T, H = A/C/T, V = A/C/G, N = A/C/G/T. (*Typically with low codon usage; suppressor mutation may be used to block it. **Typically with low codon usage, especially in yeast; suppressor mutation may be used to block it).
Fig. 12
Fig. 12. The traditional recombination method for diversity creation. Recombination requires a sample of different variants of a gene (parents), which can be derived from a family of homologous genes or generated by random mutagenesis methods. The random fragmentation of these genes (using DNase I or other method) cleaves them into small constituent parts. Importantly, as the parental genes are all homologous, the fragments overlap in sequence thus allowing them to be reassembled by overlap extension PCR (OE-PCR) producing products that encode a random mixture of the parental genes. A key advantage of recombination methods is the improved ability to create combinatorial mutations. This is illustrated using two mutations (present in two different parental sequences) that when recombined separately produce no fitness improvement, but when combined together produce a variant with improved fitness.
Fig. 13
Fig. 13. GeneGenie and SpeedyGenes: synthetic biology tools for the purposes of directed evolution. The integration of computational design and accurate gene synthesis methodology provide a strong platform that can be utilised for directed evolution. As an example, the design, synthesis and screening of a small library of EGFP variants is shown. Mixed base codons are used to encode the green and blue variants of EGFP in a single library. (A) GeneGenie (www.gene-genie.org/) designs overlapping oligonucleotides for a given protein together with any specific mixed base codon (here YAT denoting C/T,A,T). (B) SpeedyGenes assembles the gene sequence using these oligonucleotides, accurately (using error correction) producing variant libraries with the desired mutations. (C) Direct expression (no pre-selection) of the library in E. coli yielded colonies with the desired mutations (green or blue fluorescence).
Fig. 14
Fig. 14. The principle of genetic selection, here illustrated with a transporter gene knockout mutant in competition with others that does not take up toxic levels of an otherwise cytotoxic drug D.
Fig. 15
Fig. 15. The principles of building and testing a machine learning model, illustrated here with a QSAR model. We start with paired inputs and outputs (here sequences and activities) and learn a nonlinear mapping between the two. Methods for doing this that we have found effective include genetic programming and random forests. In a second phase, the learned model is used to make predictions on an unseen validation and/or test set to establish that the model has generalized well.
Fig. 16
Fig. 16. A standard representation of an energy diagram for enzyme catalysis. Substrate binding is thermodynamically favourable, but to effect the catalytic reaction thermal energy is used to take the reaction to the right, often shown as a barrier represented by one or more ‘transition states’. Changes in the K m and K d (affecting substrate affinity) can be influenced most directly by mutagenesis of the residues at the active site whilst changes in the k cat occur primarily from mutagenesis of residues away from the active site (which can affect the fluctuations in enzyme structure required either for crossing the transition state ES or by tunnelling under the barrier). At all points there are multiple roughly iso-energetic conformational (sub)states. Figure based on elements of those in ref. 930 and 966.
Fig. 17
Fig. 17. The residues that influence k cat tend to be distributed throughout an enzyme. The amino acid side chains of each of the 24 mutations obtained by the directed evolution of a Dielsalderase (PDB: ; 4O5T) are highlighted. The active site pocket is shown in grey, while all mutated residues within 5 Å of the ligand (blue) are differentiated from those more than 5 Å away (red). This illustrates that the majority of mutations influencing k cat are not in close proximity to the active (substrate-binding) site. The figure was prepared using PyMol.
None
Andrew Currin
None
Neil Swainston
None
Philip J. Day
None
Douglas B. Kell

References

    1. Kell D. B. BioEssays. 2012;34:236–244. - PMC - PubMed
    1. Phylogenetic analysis of DNA sequences, ed. M. M. Miyamoto and J. Cracraft, Oxford University Press, Oxford, 1991.
    1. Page R. D. M. and Holmes E. C., Molecular evolution: a phylogenetic approach, Blackwell Science, Oxford, 1998.
    1. Harms M. J., Thornton J. W. Nat. Rev. Genet. 2013;14:559–571. - PMC - PubMed
    1. Gibson D. G., Benders G. A., Axelrod K. C., Zaveri J., Algire M. A., Moodie M., Montague M. G., Venter J. C. Proc. Natl. Acad. Sci. U. S. A. 2008;105:20404–20409. - PMC - PubMed

Publication types