Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 Dec;11(12):2804-13.
doi: 10.1110/ps.0203902.

Thoroughly sampling sequence space: large-scale protein design of structural ensembles

Affiliations

Thoroughly sampling sequence space: large-scale protein design of structural ensembles

Stefan M Larson et al. Protein Sci. 2002 Dec.

Abstract

Modeling the inherent flexibility of the protein backbone as part of computational protein design is necessary to capture the behavior of real proteins and is a prerequisite for the accurate exploration of protein sequence space. We present the results of a broad exploration of sequence space, with backbone flexibility, through a novel approach: large-scale protein design to structural ensembles. A distributed computing architecture has allowed us to generate hundreds of thousands of diverse sequences for a set of 253 naturally occurring proteins, allowing exciting insights into the nature of protein sequence space. Designing to a structural ensemble produces a much greater diversity of sequences than previous studies have reported, and homology searches using profiles derived from the designed sequences against the Protein Data Bank show that the relevance and quality of the sequences is not diminished. The designed sequences have greater overall diversity than corresponding natural sequence alignments, and no direct correlations are seen between the diversity of natural sequence alignments and the diversity of the corresponding designed sequences. For structures in the same fold, the sequence entropies of the designed sequences cluster together tightly. This tight clustering of sequence entropies within a fold and the separation of sequence entropy distributions for different folds suggest that the diversity of designed sequences is primarily determined by a structure's overall fold, and that the designability principle postulated from studies of simple models holds in real proteins. This has important implications for experimental protein design and engineering, as well as providing insight into protein evolution.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Ten representative backbone traces from the structural ensemble used in designing sequences for 1abo, the SH3 domain from Abl tyrosine kinase. All structures are within 1 Å RMSD of each other.
Fig. 2.
Fig. 2.
Entropy distributions of designed and sterically allowed residues and sequences. (A) Residue entropies of all designed positions are plotted in black. As well, the set of all sterically allowed rotamers at each position of each structure was calculated. The distribution of residue entropies for this set is plotted in gray. (B) The sequence entropy (mean residue entropy) for each structure was calculated. The distribution of sequence entropies for the designed sequences is plotted in black, with the sequence entropy from the allowed rotamers in gray.
Fig. 3.
Fig. 3.
Distribution of average amino acid identity of the designed sequences to the native target sequence for 253 structures. Identity to the native target sequence was calculated first for the set of sequences designed using only a single fixed target backbone as a target template (all residues: black dashed line; buried residues: gray dashed line). Using structural ensembles of 100 structural variants as target templates narrows and lowers the distribution of identity to the target native sequence (all residues: black solid line; buried residues: gray solid line).
Fig. 4.
Fig. 4.
Sequence entropy increases with the size of the structural ensemble used for design. The traces represent the sequence entropy of the designed sequences obtained when using increasing numbers of structural variants as targets for design. The black traces represent the two structures that produced sequence sets with the highest and lowest average sequence entropy. The gray traces are for 100 different structures randomly picked from the remaining 251 proteins.
Fig. 5.
Fig. 5.
Results of PSI-BLAST searches against the Protein Data Bank using sequence profiles generated from the designed sequences. Two hundred forty-one of the 253 structures (those that gave hits) are represented here, ranked along the x-axis by the E-value of the most significant hit obtained from that structure’s designed sequence profile. Dark columns represent sequence profiles that gave hits against true structural homologues (true positives). Light columns identify sequence profiles that produced hits to nonhomologs (false positives). A threshold of E < 1.0 gives an accuracy of 92% (176 of 186) for 74% (186 of 253) of all sequence profile searches. The gray points plot the average amino acid identity of each sequence profile to the native target sequence.
Fig. 6.
Fig. 6.
Sequence entropy distributions of designed sequences, grouped by structure into folds. The six folds are identified by the names corresponding to their PFAM sequence families. The frequencies for each fold are normalized to unity. The sequence entropy distribution for all 253 structures is also shown.

Similar articles

Cited by

References

    1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402. - PMC - PubMed
    1. Baldwin, E.P., Hajiseyedjavadi, O., Baase, W.A., and Matthews, B.W. 1993. The role of backbone flexibility in the accommodation of variants that repack the core of T4 lysozyme. Science 262 1715–1718. - PubMed
    1. Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., and Sonnhammer, E.L. 2000. The Pfam protein families database. Nucleic Acids Res. 28 263–266. - PMC - PubMed
    1. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28 235–242. - PMC - PubMed
    1. Bornscheuer, U.T. and Pohl, M. 2001. Improved biocatalysts by directed evolution and rational protein design. Curr. Opin. Chem. Biol. 5 137–143. - PubMed

LinkOut - more resources