Thoroughly sampling sequence space: large-scale protein design of structural ensembles

Stefan M Larson¹, Jeremy L England, John R Desjarlais, Vijay S Pande

Affiliations

PMID: 12441379
PMCID: PMC2373757
DOI: 10.1110/ps.0203902

Thoroughly sampling sequence space: large-scale protein design of structural ensembles

Stefan M Larson et al. Protein Sci. 2002 Dec.

. 2002 Dec;11(12):2804-13.

doi: 10.1110/ps.0203902.

Authors

Stefan M Larson¹, Jeremy L England, John R Desjarlais, Vijay S Pande

Affiliation

¹ Chemistry Department and Biophysics Program, Stanford University, California 94305, USA.

PMID: 12441379
PMCID: PMC2373757
DOI: 10.1110/ps.0203902

Abstract

Modeling the inherent flexibility of the protein backbone as part of computational protein design is necessary to capture the behavior of real proteins and is a prerequisite for the accurate exploration of protein sequence space. We present the results of a broad exploration of sequence space, with backbone flexibility, through a novel approach: large-scale protein design to structural ensembles. A distributed computing architecture has allowed us to generate hundreds of thousands of diverse sequences for a set of 253 naturally occurring proteins, allowing exciting insights into the nature of protein sequence space. Designing to a structural ensemble produces a much greater diversity of sequences than previous studies have reported, and homology searches using profiles derived from the designed sequences against the Protein Data Bank show that the relevance and quality of the sequences is not diminished. The designed sequences have greater overall diversity than corresponding natural sequence alignments, and no direct correlations are seen between the diversity of natural sequence alignments and the diversity of the corresponding designed sequences. For structures in the same fold, the sequence entropies of the designed sequences cluster together tightly. This tight clustering of sequence entropies within a fold and the separation of sequence entropy distributions for different folds suggest that the diversity of designed sequences is primarily determined by a structure's overall fold, and that the designability principle postulated from studies of simple models holds in real proteins. This has important implications for experimental protein design and engineering, as well as providing insight into protein evolution.

PubMed Disclaimer

Figures

**Fig. 1.**
Ten representative backbone traces from the structural ensemble used in designing sequences for 1abo, the SH3 domain from Abl tyrosine kinase. All structures are within 1 Å RMSD of each other.

**Fig. 2.**
Entropy distributions of designed and sterically allowed residues and sequences. (A) Residue entropies of all designed positions are plotted in black. As well, the set of all sterically allowed rotamers at each position of each structure was calculated. The distribution of residue entropies for this set is plotted in gray. (B) The sequence entropy (mean residue entropy) for each structure was calculated. The distribution of sequence entropies for the designed sequences is plotted in black, with the sequence entropy from the allowed rotamers in gray.

**Fig. 3.**
Distribution of average amino acid identity of the designed sequences to the native target sequence for 253 structures. Identity to the native target sequence was calculated first for the set of sequences designed using only a single fixed target backbone as a target template (all residues: black dashed line; buried residues: gray dashed line). Using structural ensembles of 100 structural variants as target templates narrows and lowers the distribution of identity to the target native sequence (all residues: black solid line; buried residues: gray solid line).

**Fig. 4.**
Sequence entropy increases with the size of the structural ensemble used for design. The traces represent the sequence entropy of the designed sequences obtained when using increasing numbers of structural variants as targets for design. The black traces represent the two structures that produced sequence sets with the highest and lowest average sequence entropy. The gray traces are for 100 different structures randomly picked from the remaining 251 proteins.

**Fig. 5.**
Results of PSI-BLAST searches against the Protein Data Bank using sequence profiles generated from the designed sequences. Two hundred forty-one of the 253 structures (those that gave hits) are represented here, ranked along the x-axis by the E-value of the most significant hit obtained from that structure’s designed sequence profile. Dark columns represent sequence profiles that gave hits against true structural homologues (true positives). Light columns identify sequence profiles that produced hits to nonhomologs (false positives). A threshold of E < 1.0 gives an accuracy of 92% (176 of 186) for 74% (186 of 253) of all sequence profile searches. The gray points plot the average amino acid identity of each sequence profile to the native target sequence.

**Fig. 6.**
Sequence entropy distributions of designed sequences, grouped by structure into folds. The six folds are identified by the names corresponding to their PFAM sequence families. The frequencies for each fold are normalized to unity. The sequence entropy distribution for all 253 structures is also shown.

See this image and copyright information in PMC

Cited by

Computationally designed libraries of fluorescent proteins evaluated by preservation and diversity of function.
Treynor TP, Vizcarra CL, Nedelcu D, Mayo SL. Treynor TP, et al. Proc Natl Acad Sci U S A. 2007 Jan 2;104(1):48-53. doi: 10.1073/pnas.0609647103. Epub 2006 Dec 19. Proc Natl Acad Sci U S A. 2007. PMID: 17179210 Free PMC article.
Toward full-sequence de novo protein design with flexible templates for human beta-defensin-2.
Fung HK, Floudas CA, Taylor MS, Zhang L, Morikis D. Fung HK, et al. Biophys J. 2008 Jan 15;94(2):584-99. doi: 10.1529/biophysj.107.110627. Epub 2007 Sep 7. Biophys J. 2008. PMID: 17827237 Free PMC article.
Predicting the tolerated sequences for proteins and protein interfaces using RosettaBackrub flexible backbone design.
Smith CA, Kortemme T. Smith CA, et al. PLoS One. 2011;6(7):e20451. doi: 10.1371/journal.pone.0020451. Epub 2011 Jul 18. PLoS One. 2011. PMID: 21789164 Free PMC article.
Semirational Directed Evolution of Loop Regions in Aspergillus japonicus β-Fructofuranosidase for Improved Fructooligosaccharide Production.
Trollope KM, Görgens JF, Volschenk H. Trollope KM, et al. Appl Environ Microbiol. 2015 Oct;81(20):7319-29. doi: 10.1128/AEM.02134-15. Epub 2015 Aug 7. Appl Environ Microbiol. 2015. PMID: 26253664 Free PMC article.
Use of designed sequences in protein structure recognition.
Kumar G, Mudgal R, Srinivasan N, Sandhya S. Kumar G, et al. Biol Direct. 2018 May 9;13(1):8. doi: 10.1186/s13062-018-0209-6. Biol Direct. 2018. PMID: 29776380 Free PMC article.

See all "Cited by" articles

References

1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402. - PMC - PubMed
1. Baldwin, E.P., Hajiseyedjavadi, O., Baase, W.A., and Matthews, B.W. 1993. The role of backbone flexibility in the accommodation of variants that repack the core of T4 lysozyme. Science 262 1715–1718. - PubMed
1. Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., and Sonnhammer, E.L. 2000. The Pfam protein families database. Nucleic Acids Res. 28 263–266. - PMC - PubMed
1. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28 235–242. - PMC - PubMed
1. Bornscheuer, U.T. and Pohl, M. 2001. Improved biocatalysts by directed evolution and rational protein design. Curr. Opin. Chem. Biol. 5 137–143. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Thoroughly sampling sequence space: large-scale protein design of structural ensembles

Affiliation

Thoroughly sampling sequence space: large-scale protein design of structural ensembles

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources