Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 3;11(11):1627.
doi: 10.3390/biom11111627.

Discovering the Ultimate Limits of Protein Secondary Structure Prediction

Affiliations

Discovering the Ultimate Limits of Protein Secondary Structure Prediction

Chia-Tzu Ho et al. Biomolecules. .

Abstract

Secondary structure prediction (SSP) of proteins is an important structural biology technique with many applications. There have been ~300 algorithms published in the past seven decades with fierce competition in accuracy. In the first 60 years, the accuracy of three-state SSP rose from ~56% to 81%; after that, it has long stayed at 81-86%. In the 1990s, the theoretical limit of three-state SSP accuracy had been estimated to be 88%. Thus, SSP is now generally considered not challenging or too challenging to improve. However, we found that the limit of three-state SSP might be underestimated. Besides, there is still much room for improving segment-based and eight-state SSPs, but the limits of these emerging topics have not been determined. This work performs large-scale sequence and structural analyses to estimate SSP accuracy limits and assess state-of-the-art SSP methods. The limit of three-state SSP is re-estimated to be ~92%, 4-5% higher than previously expected, indicating that SSP is still challenging. The estimated limit of eight-state SSP is 84-87%. Several proposals for improving future SSP algorithms are made based on our results. We hope that these findings will help move forward the development of SSP and all its applications.

Keywords: protein secondary structure prediction; protein sequence; protein sequence-based predictions; protein structure; structural biology.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure A1
Figure A1
A general schema of the methodology. Sequence and structural alignments between protein structural homologs were performed to estimate the upper limits of SSP accuracy, while random pairing between protein structures was conducted to draw the lower bound of SSP. Note that only aligned residues were considered in estimating SSP limits. The effects of disordered regions, which were likely presented in the unaligned segments, might thus be ignored.
Figure 1
Figure 1
Secondary structural consistency between homologs identified by PSI-BLAST from PDB. (a) Secondary structural consistency at different sequence identity levels. The horizontal axis indicates the sequence identity level where, for instance, 30 means that the sequence identities of homologous structures fall between ≥30% and <35%. These levels were made according to Yang et al. [1]. The consistency dropped as the identity decreased, and our 10-repeat experimental results agreed well with [1] (the solid purple curve of Q3 versus the red). Yang et al. only analyzed high resolutions (<3 Å) X-ray structures. Following the same filtering criteria, our results (dotted purple curves) were closer to theirs. For easy observation, we enlarged the vertical axis of Q3 with identities ≥50. The secondary structural consistency revealed the structural difference between homologs, which is the ultimate factor limiting SSP accuracy. For instance, at the 30% identity level, the averaged consistency is 87.8%, meaning an 11.2% difference in secondary structure, which indicates that the SSP accuracy of such homologs can be at most ~88%. Previous studies measured the secondary structure consistency by sequence alignments [1,62]. We also performed structural alignments (black lines). (b) Secondary structural consistency at different sequence identity thresholds. The horizontal axis indicates the threshold where, for instance, 30 means that the sequence identities of homologous structures are ≥30%. Previous studies estimated the limit of SSP on the basis of three-state SSEs. We also made estimations based on the eight-state SSEs.
Figure 2
Figure 2
Secondary structural consistency measured by sequence alignments between structural homologs verified by SCOP. (a) Three-state secondary structural consistency (Q3) at various sequence identity levels. Using the structural homologs determined by SCOP to repeat the experiment of previous studies, the Q3 data of homologs sharing ≥30% identity were very similar to previous results. The purple curve of PSI-BLAST and PDB is obtained from Figure 1a for comparison. The Q3 differed significantly at low identities, with SCOP-determined homologs being much higher than PSI-BLAST-identified ones. Note that the difference was mainly caused by how homologs were defined (manual curations vs. programmatic searches) rather than by the source of data (SCOP vs. PDB). The Q3 was measured by PSI-BLAST, classic BLAST, Water, and Stretcher. The classic BLAST was omitted in this figure for clarity because its curve lay closely to that of PSI-BLAST (see Table S3 for full data, with the classic BLAST included). (b) Eight-state secondary structural consistency (Q8) at various sequence identity levels. (c) Q3 at various sequence identity cutoffs. The horizontal axis indicates the cutoff where, for instance, 30 and 90 mean that the identities of homologous structures are lower than 30% and 90%, respectively. Since protein datasets are typically prepared as nonredundant sets of identities lower than given cutoffs in practical research, in all figures after this, we only display the consistency computed under given identity cutoffs instead of thresholds. (d) Q8 at various sequence identity cutoffs.
Figure 3
Figure 3
Secondary structural consistency measured by structural alignments between structural homologs determined by SCOP. (a) Three-state secondary structural consistency (Q3) at various sequence identity levels. (b) Eight-state secondary structural consistency (Q8) at various sequence identity levels. (c) Q3 at different various identity cutoffs. (d) Q8 at various sequence identity cutoffs. To make comparisons easier in these plots, the results of Water, which marked the highest Q3 and Q8 measured by sequence alignments in Figure 2, are shown again for reference. Structure alignment methods applied in this experiment were FAST, TM-align, and SARST. All the Q3/8 values obtained with them are higher than those obtained with sequence alignments, especially at low identities. Compared with Q3, the differences between structural and sequence alignments were much more significant in Q8. Structural alignments had a better capability of detecting the residue-residue equivalences between structural homologs than sequence alignments, even when they had distant evolutionary relationships. Thus, structural alignments should be more suitable than sequence alignments for estimating the limit of SSP.
Figure 4
Figure 4
Secondary structural consistency for structural homologs of different structural classes or sizes. (a) The secondary structural consistency for homologs belonging to different structural classes and sharing <90% identities. (b) The consistency for homologs having different sizes and sharing <90% identities. (c) The consistency for homologs belonging to different structural classes and sharing <30% identities. (d) The consistency for homologs having different sizes and sharing <30% identities. In these plots, the secondary structural consistencies computed by sequence alignments (method: PSI-BLAST) and structural alignments (method: FAST) are drawn as blue and black bars, respectively; because the latter are all higher than the former, only black caps are visible. PSI-BLAST represents sequence alignment methods because most SSP algorithms work depending on it. FAST represents structure alignment methods because it reported the highest Q3/8 between SCOP homologs (Figure 3).
Figure 5
Figure 5
Distribution of secondary structural consistency between random protein pairs. One million rounds of random pairing were performed to compute the distribution of secondary structural consistencies between proteins to mark the lower bound of SSP accuracy. Two proteins were randomly selected from the SCOP-2.07 dataset in each round, and the secondary structural consistency between them was measured with the Q and SOV scores. The average SOV3 of these random pairs was lower than the average Q3 (SOV3 = 32% versus Q3 = 35%), different from the results of the first paper of SOV (SOV3 = 37% versus Q3 = 35%) [62], because the algorithm of SOV applied in this study was an updated version (v’99) [66], the value of which is lower than the first defined SOV (v’94; see Table I of [66] for comparisons).
Figure 6
Figure 6
Accuracies of state-of-the-art SSP methods evaluated with datasets sharing <90% sequence identities. This test utilized the independent query sets prepared by Yang et al. for SSP evaluations [1] and the SCOP720 query set. The source of reference sequences for PSSM generation was UniRef90-2015. The blue and black bars indicate the SSP upper limits estimated at a <90% sequence identity cutoff by sequence and structural alignments, respectively. The sequence alignment algorithm used to make those estimates was PSI-BLAST, the PSSM generator of all the tested SSP methods. These results reveal that, when the homology of the SSP query and reference sequences was <90% identity, the accuracy of current SSP methods has not reached the limit estimated by sequence alignments of homologs sharing <90% sequence identities.
Figure 7
Figure 7
The accuracy of the current SSP methodology applied with restricted homology between developmental and operational datasets. For avoiding the side effects of sequence redundancy, which might cause information leakage and overfitting for a predictor, it is typical for computational biology studies to restrict protein datasets with some sequence homology cutoff. We speculated that the SSP limits estimated based on homologs with different homology cutoffs (blue curves, adopted from the result of PSI-BLAST in Figure 2c and Table S3) would draw the accuracy upper bound for SSP methods trained and operated with datasets where the homology of proteins is restricted. The protein materials used here were prepared so that the sequence identities within and between the query and PSSM reference sets were lower than the given cutoffs. As expected, as the homology decreased, the accuracy of SSP models trained and operated with such proteins decreased. This experiment was repeated 10 times by random sampling.
Figure 8
Figure 8
Accuracy of state-of-the-art SSP methods for proteins of different (a) structural classes or (b) sizes. The upper limit of SSP accuracy estimated by sequence and structural alignments at 90% identity cutoff of homologs are respectively indicated by blue and black bars. No method exceeded the estimated limits in any class or size group, and most methods met the greatest challenge in predicting all-beta proteins. SSP accuracy decreased a little as the protein size increased. Since the same tendencies were observed in three- and eight-state predictions, this figure only displays Q3 and SOV3. See Figure S1 for Q8 and SOV8 results.
Figure 9
Figure 9
Secondary structural consistency measured by PSI-BLAST and BLAST with different word sizes. (a) Secondary structural consistencies between SCOP family homologs measured by PSI-BLAST at different identity levels. As illustrated by the color codes, word sizes 3 and 2 were applied. The default word size of PSI-BLAST was 3, but changing it to 2 increased the measured secondary structural consistency between homologs, especially at low identities. (b) Secondary structural consistencies between SCOP family homologs measured by the classic BLAST at different identity levels. The BLAST accepted word sizes 3 and 2 in the database searching mode (blastall) and word size 1 in the pairwise alignment mode (bl2seq). A small word size notably increased the measured secondary structural consistency. (c), (d) Secondary structural consistencies between SCOP family homologs measured by (PSI-)BLAST at different identity cutoffs. Results obtained according to identity cutoffs make the effects of word size more observable.
Figure 10
Figure 10
Summary of the estimated limits of SSP accuracy. Both the lower limits (light blue strips) and upper limits of three-/eight-state SSP accuracies are illustrated. The practical upper limits of accuracies (blue strips to the right) were estimated by PSI-BLAST sequence alignments with the word size 2 (see Figure 9 and Table S3), and the theoretical limits (black strips to the right) were estimated by structural alignments (Figure 3 and Table S3).

References

    1. Yang Y., Gao J., Wang J., Heffernan R., Hanson J., Paliwal K., Zhou Y. Sixty-five years of the long march in protein secondary structure prediction: The final stretch? Brief. Bioinform. 2018;19:482–494. doi: 10.1093/bib/bbw129. - DOI - PMC - PubMed
    1. Li B., Krishnan V.G., Mort M.E., Xin F., Kamati K.K., Cooper D.N., Mooney S.D., Radivojac P. Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics. 2009;25:2744–2750. doi: 10.1093/bioinformatics/btp528. - DOI - PMC - PubMed
    1. Folkman L., Yang Y., Li Z., Stantic B., Sattar A., Mort M., Cooper D.N., Liu Y., Zhou Y. DDIG-in: Detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels. Bioinformatics. 2015;31:1599–1606. doi: 10.1093/bioinformatics/btu862. - DOI - PubMed
    1. Zhao H., Yang Y., Lin H., Zhang X., Mort M., Cooper D.N., Liu Y., Zhou Y. DDIG-in: Discriminating between disease-associated and neutral non-frameshifting micro-indels. Genome Biol. 2013;14:R23. doi: 10.1186/gb-2013-14-3-r23. - DOI - PMC - PubMed
    1. Do C.B., Mahabhashyam M.S., Brudno M., Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005;15:330–340. doi: 10.1101/gr.2821705. - DOI - PMC - PubMed

LinkOut - more resources