Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 1;36(Suppl_1):i317-i325.
doi: 10.1093/bioinformatics/btaa336.

Boosting the accuracy of protein secondary structure prediction through nearest neighbor search and method hybridization

Affiliations

Boosting the accuracy of protein secondary structure prediction through nearest neighbor search and method hybridization

Spencer Krieger et al. Bioinformatics. .

Abstract

Motivation: Protein secondary structure prediction is a fundamental precursor to many bioinformatics tasks. Nearly all state-of-the-art tools when computing their secondary structure prediction do not explicitly leverage the vast number of proteins whose structure is known. Leveraging this additional information in a so-called template-based method has the potential to significantly boost prediction accuracy.

Method: We present a new hybrid approach to secondary structure prediction that gains the advantages of both template- and non-template-based methods. Our core template-based method is an algorithmic approach that uses metric-space nearest neighbor search over a template database of fixed-length amino acid words to determine estimated class-membership probabilities for each residue in the protein. These probabilities are then input to a dynamic programming algorithm that finds a physically valid maximum-likelihood prediction for the entire protein. Our hybrid approach exploits a novel accuracy estimator for our core method, which estimates the unknown true accuracy of its prediction, to discern when to switch between template- and non-template-based methods.

Results: On challenging CASP benchmarks, the resulting hybrid approach boosts the state-of-the-art Q8 accuracy by more than 2-10%, and Q3 accuracy by more than 1-3%, yielding the most accurate method currently available for both 3- and 8-state secondary structure prediction.

Availability and implementation: A preliminary implementation in a new tool we call Nnessy is available free for non-commercial use at http://nnessy.cs.arizona.edu.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Template database words overlapping a given query residue contribute to its class membership probability. Across the bottom is a portion of the amino acid sequence of an actual query protein. Stacked above the residue highlighted by the arrows in the query sequence are the 1-nearest-neighbor words from our template database found for the query words containing the highlighted residue. The known secondary structure class of each amino acid is indicated by its color: blue for α, green for β and black for γ. The example is for word length 23, and positional weights ω from our implementation (namely, uniform across a word, except the central weight is doubled). For this word length and positional weights, the α-, β- and γ-class membership probabilities for the highlighted residue are 0.00, 0.62 and 0.38, respectively.
Fig. 2.
Fig. 2.
Average accuracy on PDB2019, using a fraction of the full template database. The horizontal axis is the percentage of the full template database used, for a random subset of the database. The vertical axis is the accuracy on PDB2019. The solid curves in the plot give the Q3 and Q8 accuracy of Nnessy. Dashed lines represent the accuracy of other methods on PDB2019; their intersection point with a curve gives the fractional size of the template database at which Nnessy meets their accuracy. Only tools whose accuracies intersect each curve are shown. By reducing its template database size, Nnessy can be further sped up, and still exceed the accuracy of state-of-the-art tools on such datasets.
Fig. 3.
Fig. 3.
Comparison of residue accuracy to overlapping word distance. Each blue circle represents a residue from the evaluation datasets, with the average distance of its overlapping nearest-neighbor words given by the horizontal axis, and the average accuracy of the 100 residues with closest average word distance shown on the vertical axis. The dashed lines show the fitted accuracy estimator, which gives the estimated probability that the predicted state of a residue is correct, as a function of its average word distance.
Fig. 4.
Fig. 4.
Comparison of true and estimated accuracy for our accuracy estimator. The blue circles represent proteins from all datasets, with the estimated accuracy of the Nnessy prediction for the protein shown on the horizontal axis, and the true accuracy of the prediction on the vertical axis. Along the dotted line, estimated accuracy is equal to true accuracy. The dashed line is at the threshold used by our hybrid method; circles to its right are Nnessy predictions chosen by the hybrid method. The red discs show the average true accuracy and average estimated accuracy of proteins in the four CASP datasets; green discs show the same for the six PDB datasets. The estimated accuracy of a protein is from an estimator fitted on evaluation datasets that do not include that protein. The closer a circle is to the dotted line, the more accurate the accuracy estimator.
Fig. 5.
Fig. 5.
Visualization of the hybrid method. Each value along the horizontal axis corresponds to a single protein from CASP datasets. These CASP proteins are sorted along this axis by their Q8 accuracy for the Nnessy-PORTER hybrid. At the rank of each such protein in this sorted order, a blue circle and a green triangle are plotted, with the vertical axis giving the Q8 accuracy of their Nnessy and PORTER prediction, respectively. The solid black curve goes through the prediction that is selected by the hybrid method. Circles or triangles above this curve correspond to proteins for which the hybrid selection has suboptimal accuracy, while all those below are proteins for which the hybrid method is optimal.

Similar articles

Cited by

References

    1. Adamczak R. et al. (2004) Accurate prediction of solvent accessibility using neural networks-based regression. Proteins, 56, 753–767. - PubMed
    1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
    1. Berman H. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. - PMC - PubMed
    1. Beygelzimer A. et al. (2006) Cover trees for nearest neighbor. In Proceedings of the 23rd International Conference on Machine Learning (ICML). [CrossRef][10.1145/1143844.1143857]
    1. DeBlasio D., Kececioglu J. (2017) Parameter Advising for Multiple Sequence Alignment, Volume 26 of Computational Biology Series. Springer International. Cham, Switzerland.

Publication types