. 2007 Jun 14:8:201.

doi: 10.1186/1471-2105-8-201.

Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information

Gianluca Pollastri¹, Alberto J M Martin, Catherine Mooney, Alessandro Vullo

Affiliations

Affiliation

¹ Complex and Adaptive Systems Laboratory, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland. gianluca.pollastri@ucd.ie

PMID: 17570843
PMCID: PMC1913928
DOI: 10.1186/1471-2105-8-201

Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information

Gianluca Pollastri et al. BMC Bioinformatics. 2007.

. 2007 Jun 14:8:201.

doi: 10.1186/1471-2105-8-201.

Authors

Gianluca Pollastri¹, Alberto J M Martin, Catherine Mooney, Alessandro Vullo

Affiliation

¹ Complex and Adaptive Systems Laboratory, School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland. gianluca.pollastri@ucd.ie

PMID: 17570843
PMCID: PMC1913928
DOI: 10.1186/1471-2105-8-201

Abstract

Background: Structural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, not only in the ab initio case but also when homology information to known structures is available. Structural properties are also routinely used in protein analysis even when homology is available, largely because homology modelling is lower throughput than, say, secondary structure prediction. Nonetheless, predictors of secondary structure and solvent accessibility are virtually always ab initio.

Results: Here we develop high-throughput machine learning systems for the prediction of protein secondary structure and solvent accessibility that exploit homology to proteins of known structure, where available, in the form of simple structural frequency profiles extracted from sets of PDB templates. We compare these systems to their state-of-the-art ab initio counterparts, and with a number of baselines in which secondary structures and solvent accessibilities are extracted directly from the templates. We show that structural information from templates greatly improves secondary structure and solvent accessibility prediction quality, and that, on average, the systems significantly enrich the information contained in the templates. For sequence similarity exceeding 30%, secondary structure prediction quality is approximately 90%, close to its theoretical maximum, and 2-class solvent accessibility roughly 85%. Gains are robust with respect to template selection noise, and significant for marginal sequence similarity and for short alignments, supporting the claim that these improved predictions may prove beneficial beyond the case in which clear homology is available.

Conclusion: The predictive system are publicly available at the address http://distill.ucd.ie.

PubMed Disclaimer

Figures

**Figure 1**
Distribution of secondary structure prediction accuracy as a function of sequence similarity to the best hit in PSI-BLAST templates. The blue bars represent predictions using templates (maximal sequence similarity allowed is 95%), the red bars template-less predictions (Porter). See text for details.

**Figure 2**
Distribution of secondary structure prediction accuracy as a function of the length of the best hit in PSI-BLAST templates. Maximal 30% identity between template and query allowed. The blue bars represent predictions using templates, the red bars template-less predictions. See text for details.

**Figure 3**
An example of prediction by Porter_H compared to Porter, DSSP assignments, and best template. Best template sequence similarity is 22%. Porter_H correctly identifies the first helix (from the template – strand in Porter), but does not follow the template and assigns correctly the second strand (helix in the template).

**Figure 4**
Distribution of secondary structure prediction accuracy as a function of quality of the best hit in PSI-BLAST templates. Quality measured as Resolution+Rfactor/20. The blue bars represent predictions using templates, the red bars template-less predictions (Porter). See text for details.

**Figure 5**
Distribution of 4-class (4%, 25% and 50% exposed thresholds) solvent accessibility prediction accuracy as a function of sequence similarity to the best hit in PSI-BLAST templates. The blue bars represent predictions using templates (maximal sequence similarity allowed is 95%), the red bars template-less predictions. See text for details.

**Figure 6**
Distribution of 4-class (4%, 25% and 50% exposed thresholds) solvent accessibility prediction accuracy as a function of quality of the best hit in PSI-BLAST templates. Quality measured as Resolution+Rfactor/20. The blue bars represent predictions using templates, the red bars template-less predictions. See text for details.

**Figure 7**
Distribution of best-hit (blue) and average (red) sequence similarity in the PSI-BLAST templates for the S2171 set. Hits above 95% sequence similarity excluded.

See this image and copyright information in PMC

References

1. Bradley P, Chivian D, Meiler J, Misura K, Rohl C, Schief W, Wedemeyer W, Schueler-Furman O, Murphy P, Schonbrun J, Strauss C, Baker D. Rosetta predictions in CASP5: Successes, failures, and prospects for complete automation. Proteins. 2003;53:457–468. doi: 10.1002/prot.10552. - DOI - PubMed
1. Jones D. GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol. 1999;287:797–815. doi: 10.1006/jmbi.1999.2583. - DOI - PubMed
1. Karchin R, Cline M, Mandel-Gutfreund Y, Karplus K. Hidden markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins. 2003;51:504–14. doi: 10.1002/prot.10369. - DOI - PubMed
1. Przybylski D, Rost B. Improving Fold Recognition Without Folds. Journal of Molecular Biology. 2004;341:255–269. doi: 10.1016/j.jmb.2004.05.041. - DOI - PubMed
1. Rost B, Yachdav G, Liu J. The PredictProtein server. Nucleic Acids Research. 2004;32:W321–326. doi: 10.1093/nar/gkh377. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information

Affiliation

Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources