. 2015 Apr 22;10(4):e0123998.

doi: 10.1371/journal.pone.0123998. eCollection 2015.

Building a better fragment library for de novo protein structure prediction

Saulo H P de Oliveira¹, Jiye Shi², Charlotte M Deane¹

Affiliations

¹ Department of Statistics, Oxford University, Oxford, Oxfordshire, United Kingdom.
² Department of Informatics, UCB Pharma, Slough, United Kingdom; Shanghai Institute of Applied Physics, Chinese Academy of Sciences, Shanghai, China.

PMID: 25901595
PMCID: PMC4406757
DOI: 10.1371/journal.pone.0123998

Building a better fragment library for de novo protein structure prediction

Saulo H P de Oliveira et al. PLoS One. 2015.

. 2015 Apr 22;10(4):e0123998.

doi: 10.1371/journal.pone.0123998. eCollection 2015.

Authors

Saulo H P de Oliveira¹, Jiye Shi², Charlotte M Deane¹

Affiliations

¹ Department of Statistics, Oxford University, Oxford, Oxfordshire, United Kingdom.
² Department of Informatics, UCB Pharma, Slough, United Kingdom; Shanghai Institute of Applied Physics, Chinese Academy of Sciences, Shanghai, China.

PMID: 25901595
PMCID: PMC4406757
DOI: 10.1371/journal.pone.0123998

Abstract

Fragment-based approaches are the current standard for de novo protein structure prediction. These approaches rely on accurate and reliable fragment libraries to generate good structural models. In this work, we describe a novel method for structure fragment library generation and its application in fragment-based de novo protein structure prediction. The importance of correct testing procedures in assessing the quality of fragment libraries is demonstrated. In particular, the exclusion of homologs to the target from the libraries to correctly simulate a de novo protein structure prediction scenario, something which surprisingly is not always done. We demonstrate that fragments presenting different predominant predicted secondary structures should be treated differently during the fragment library generation step and that exhaustive and random search strategies should both be used. This information was used to develop a novel method, Flib. On a validation set of 41 structurally diverse proteins, Flib libraries presents both a higher precision and coverage than two of the state-of-the-art methods, NNMake and HHFrag. Flib also achieves better precision and coverage on the set of 275 protein domains used in the two previous experiments of the the Critical Assessment of Structure Prediction (CASP9 and CASP10). We compared Flib libraries against NNMake libraries in a structure prediction context. Of the 13 cases in which a correct answer was generated, Flib models were more accurate than NNMake models for 10. "Flib is available for download at: http://www.stats.ox.ac.uk/research/proteins/resources".

PubMed Disclaimer

Conflict of interest statement

Competing Interests: One of the authors [JS] is currently employed at UCB Pharma. This does not alter the authors' adherence to PLOS ONE policies on sharing data and materials.

Figures

**Fig 1. Schematics of Flib.**
Starting from a target sequence, we predict secondary structure (SS) and torsion angles for the target (green). We extract fragments from a template database using a combination of random and exhaustive approaches. Fragments are extracted for each target position. A library containing the top-3000 fragments per position is compiled using the SS score and the Ramachandran-specific sequence score (LIB3000). LIB3000 is then sorted according to the torsion angle score and the top-20 fragments per position are selected to comprise the final library. The final library (FLIB) is complemented by fragments that originate by an enrichment routine (in yellow) and fragments that originate from protein threading hits (orange).

**Fig 2. Comparison between Flib’s random extraction and exhaustive extraction methods.**
Analysis of the precision of fragment libraries generated by Flib using two different approaches for fragment extraction: random extraction (red), and exhaustive extraction (blue). We varied the RMSD to native structure cutoff to define a good fragment from 0.1 to 2.0 Angstroms (x-axis). The average precision on the 43 proteins in the test data set (left) and the average coverage (right) are shown for fragment libraries containing the top-1000 scoring fragments extracted exhaustively or at random. The precision indicates the proportion of good fragments in the generated libraries (y-axis).

**Fig 3. Effect of protein threading hits on fragment library quality.**
Analysis of the impact of fragments extracted from protein threading hits. Precision and coverage are shown for the fragment libraries generated by LIB20, Protein Threading Hits and Flib (a combination of the other two approaches). We varied the RMSD to native structure cutoff to define a good fragment from 0.1 to 2.0 Angstroms (x-axis). The average precision and coverage on the 43 proteins in the test data set is shown for each approach. The precision indicates the proportion of good fragments in the generated libraries (y-axis). The coverage indicate the proportion of residues of the target represented by at least one good fragment.

**Fig 4. Relationship between secondary structure class (SS-Class) and fragment quality.**
Boxplot of the RMSD to native structure (y-axis) of 200 fragments per target position (x-axis) for the protein 1E6K. The top-200 scoring fragments from its LIB3000 were selected and are displayed. This subset of LIB3000 was chosen to increase performance of data visualization. Four Different SS Classes are defined: *majority α-helical* (green), *majority β-strand* (red), *majority loop* (blue) and *other* (black). Positions for which fragments are *majority α-helical* or *majority β-strand* present significantly lower RMSDs to the native structure and a smaller spread compared to *majority loop* and *other* positions.

**Fig 5. Comparison between HHFrag, NNMake and Flib.**
Precision (left) and coverage (right) of fragment libraries generated using NNMake (red), HHFrag (green) and Flib (blue). The precision and coverage of the fragment libraries are averaged on a set of 41 structurally diverse proteins. We varied the RMSD cutoff to define a good fragment (x axis) and evaluated the precision (proportion of good Fragments in the libraries) and coverage (proportion of protein residues represented by a good fragment) for each method.

**Fig 6. Comparison between HHFrag, NNMake and Flib.**
Precision of fragment libraries generated using NNMake (red), HHFrag (green), and Flib (blue) separated by SS Class. The precision of the fragment libraries were averaged on a set of 41 structurally diverse proteins. We varied the cutoff to define a good fragment (x axis) and evaluated the precision (proportion of good fragments in the libraries) for each method within four different SS classes: majority α-helical (top left), majority β-strand (top right), majority loop (bottom right) and other (bottom left).

**Fig 7. Effect of Homologs on fragment library quality.**
Precision (left) and coverage (right) of fragment libraries generated using three different methods: Rosetta’s NNMake (crosses), our method Flib (circles), and HHFrag (triangles). We varied the cutoff to define a good fragment (x axis) and evaluated the precision (proportion of good fragments in the libraries) and coverage (proportion of protein residues represented by a good fragment) for each of the methods when: homologs are included (red and orange) and when homologs are excluded (light and dark green). Homologs are always excluded from Flib (blue).

**Fig 8. TM-Score of the best decoy as generated by Flib+SAINT2 and by NNMake +SAINT2.**
For each approach, 1,000 decoys were generated and the best decoy (highest TM-Score when superimposed to native structure) was chosen. Results are shown for the 41 proteins in our data set. We compared the TM-Score of best decoy generated by Flib + SAINT2 (x-axis) against NNMake + SAINT2. Each point represents a target. Point color represents the target's SCOP class and the point size is proportional to the protein length. The dotted lines indicate the cutoff for defining an accurate model (TM-Score > 0.5). Flib libraries generated accurate models for 12 of the 41 cases in our PDB-representative set. NNMake libraries generated an accurate model for 8 of the 41 cases. On the 13 cases for which accurate models were generated, Flib libraries performed better in 10 cases. Flib outperforms NNMake in 31 of the 41 cases.

See this image and copyright information in PMC

Cited by

Computational Methods for the Elucidation of Protein Structure and Interactions.
Edmunds NS, McGuffin LJ. Edmunds NS, et al. Methods Mol Biol. 2021;2305:23-52. doi: 10.1007/978-1-0716-1406-8_2. Methods Mol Biol. 2021. PMID: 33950383 Review.
Enhancing fragment-based protein structure prediction by customising fragment cardinality according to local secondary structure.
Abbass J, Nebel JC. Abbass J, et al. BMC Bioinformatics. 2020 May 1;21(1):170. doi: 10.1186/s12859-020-3491-0. BMC Bioinformatics. 2020. PMID: 32357827 Free PMC article.
Improved fragment-based protein structure prediction by redesign of search heuristics.
Kandathil SM, Garza-Fabre M, Handl J, Lovell SC. Kandathil SM, et al. Sci Rep. 2018 Sep 12;8(1):13694. doi: 10.1038/s41598-018-31891-8. Sci Rep. 2018. PMID: 30209258 Free PMC article.
Toward a detailed understanding of search trajectories in fragment assembly approaches to protein structure prediction.
Kandathil SM, Handl J, Lovell SC. Kandathil SM, et al. Proteins. 2016 Apr;84(4):411-26. doi: 10.1002/prot.24987. Epub 2016 Feb 23. Proteins. 2016. PMID: 26799916 Free PMC article.
Assigning secondary structure in proteins using AI.
Antony JV, Madhu P, Balakrishnan JP, Yadav H. Antony JV, et al. J Mol Model. 2021 Aug 17;27(9):252. doi: 10.1007/s00894-021-04825-x. J Mol Model. 2021. PMID: 34402969

See all "Cited by" articles

References

1. Raman S, Vernon R, Thompson J, Tyka M, Sadreyev R, Pei J et al. Structure prediction for CASP8 with all-atom renement using Rosetta. Proteins 77 Suppl 9:89–99. (2009) 10.1002/prot.22540 - DOI - PMC - PubMed
1. Bradley P, Misura KM, Baker D. Toward high-resolution de novo structure prediction for small proteins. Science 309, 1868–71. (2005) - PubMed
1. Bonneau R, Strauss CE, Rohl CA, Chivian D, Bradley P, Malmstrom L et al. De novo prediction of three-dimensional structures for major protein families. J Mol Biol 322(1):65–78 (2002) - PubMed
1. Bonneau R, Tsai J, Ruczinski I, Chivian D, Rohl C, Strauss CE et al. Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins Suppl 5:119–26 (2001) - PubMed
1. Holmes JB, Tsai J. Some fundamental aspects of building protein structures from fragment libraries. Protein Sci. 2004. June;13(6):1636–50. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Associated data

figshare/10.6084/M9.FIGSHARE.1328655

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Building a better fragment library for de novo protein structure prediction

Affiliations

Building a better fragment library for de novo protein structure prediction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Associated data

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous