Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 Jul;11(7):1702-13.
doi: 10.1110/ps.4820102.

In search for more accurate alignments in the twilight zone

Affiliations

In search for more accurate alignments in the twilight zone

Lukasz Jaroszewski et al. Protein Sci. 2002 Jul.

Abstract

A major bottleneck in comparative modeling is the alignment quality; this is especially true for proteins whose distant relationships could be reliably recognized only by recent advances in fold recognition. The best algorithms excel in recognizing distant homologs but often produce incorrect alignments for over 50% of protein pairs in large fold-prediction benchmarks. The alignments obtained by sequence-sequence or sequence-structure matching algorithms differ significantly from the structural alignments. To study this problem, we developed a simplified method to explicitly enumerate all possible alignments for a pair of proteins. This allowed us to estimate the number of significantly different alignments for a given scoring method that score better than the structural alignment. Using several examples of distantly related proteins, we show that for standard sequence-sequence alignment methods, the number of significantly different alignments is usually large, often about 10(10) alternatives. This distance decreases when the alignment method is improved, but the number is still too large for the brute force enumeration approach. More effective strategies were needed, so we evaluated and compared two well-known approaches for searching the space of suboptimal alignments. We combined their best features and produced a hybrid method, which yielded alignments that surpassed the original alignments for about 50% of protein pairs with minimal computational effort.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
LGA (Zemla et al. 1999) structural alignments of the models submitted by the predictors with the real structures of the two CASP4 (Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction) targets. The discrepancies between these alignments reflect the discrepancies between the alignments used for homology modeling. Real structure (the case of 100% correct prediction) would be the diagonal straight line on this plot. (A) T0117 (AF185268) is deoxyribonucleoside kinase from Drosophila melanogaster. (B) T0109 (P45340) is oligoribonuclease from Haemophilus influenzae.
Fig. 1.
Fig. 1.
LGA (Zemla et al. 1999) structural alignments of the models submitted by the predictors with the real structures of the two CASP4 (Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction) targets. The discrepancies between these alignments reflect the discrepancies between the alignments used for homology modeling. Real structure (the case of 100% correct prediction) would be the diagonal straight line on this plot. (A) T0117 (AF185268) is deoxyribonucleoside kinase from Drosophila melanogaster. (B) T0109 (P45340) is oligoribonuclease from Haemophilus influenzae.
Fig. 2.
Fig. 2.
The distribution of discrepancies between the different alignments as a function of sequence identity. Alignment discrepancy is measured as the percentage of differently aligned residues in the shorter of two alignments. The discrepancies have been calculated for a comprehensive benchmark of protein pairs consisting of 742 protein pairs selected from the Structural Classification of Proteins (SCOP) database. (A) PSI-BLAST (Altschul et al. 1997) alignments versus FASTA (Pearson and Lipman 1988) alignments. (B) PSI-BLAST alignments versus CE (Shindyalov and Bourne 1998) structural alignments.
Fig. 2.
Fig. 2.
The distribution of discrepancies between the different alignments as a function of sequence identity. Alignment discrepancy is measured as the percentage of differently aligned residues in the shorter of two alignments. The discrepancies have been calculated for a comprehensive benchmark of protein pairs consisting of 742 protein pairs selected from the Structural Classification of Proteins (SCOP) database. (A) PSI-BLAST (Altschul et al. 1997) alignments versus FASTA (Pearson and Lipman 1988) alignments. (B) PSI-BLAST alignments versus CE (Shindyalov and Bourne 1998) structural alignments.
Fig. 3.
Fig. 3.
Fold and Function Assignment System (FFAS) similarity matrix (Rychlewski et al. 2000) calculated for 1r69 and 1lccA sequences and presented as a surface plot (blue colors mean higher similarity; red colors mean lower similarity). (A similarity matrix is a matrix describing a similarity score assigned to each pair of potentially aligned residues. Here, the X-axis corresponds to the query sequence and the Y-axis corresponds to the target sequence.) The picture illustrates an obvious discrepancy between the C-terminal fragments of the best-scoring Fold and Function Assignment System (FFAS) alignment (shown as a black path on the A1 matrix surface) and the CE structural alignment (shown in blue). The best suboptimal alignment (shown in pink) overlaps with 90% of the structural alignment. Root mean square deviation values of the FFAS alignment, the best suboptimal alignment, and the structural alignment are 3.6, 2.6, and 2.1 Å, respectively. All three alignments correctly assign the second and third helix of 1r69 to the first and second from 1lccA, but the lowest-scoring FFAS alignment incorrectly embraces the C-terminal part of the last helix from 1lccA. 1lccA is the N-terminal domain of the Lac repressor (LacR) from Escherichia coli. 1r69 is the DNA-binding domain of the C1 repressor from E. coli-derived Phage 434. Both proteins belong to the same structural superfamily in the SCOP database.
Fig. 4.
Fig. 4.
The subsets of suboptimal alignments as explored with the parametric method (circles) and the iterative method (dots). In addition, CE structural alignment is shown (black line).
Fig. 5.
Fig. 5.
Applying the suboptimal alignment calculations. This graph illustrates the discrepancies between the original FFAS alignments and the CE structural alignments. The best suboptimal alignment is also shown in the graph. (A) 1bbt is foot-and-mouth disease virus protein; 1smv is sesbania mosaic virus coat protein. (B) 1bdm is malate dehydrogenase; 1dih is dihydrodipicolinate reductase.
Fig. 5.
Fig. 5.
Applying the suboptimal alignment calculations. This graph illustrates the discrepancies between the original FFAS alignments and the CE structural alignments. The best suboptimal alignment is also shown in the graph. (A) 1bbt is foot-and-mouth disease virus protein; 1smv is sesbania mosaic virus coat protein. (B) 1bdm is malate dehydrogenase; 1dih is dihydrodipicolinate reductase.

Similar articles

Cited by

References

    1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic. Acids Res. 25 3389–3402. - PMC - PubMed
    1. Berman, H.M., Bhat, T.N., Bourne, P.E., Feng, Z., Gilliland, G., Weissig, H., and Westbrook, J. 2000. The Protein Data Bank and the challenge of structural genomics. Nat. Struct. Biol. 7 957–959. - PubMed
    1. Burley, S.K., Almo, S.C., Bonanno, J.B., Capel, M., Chance, M.R., Gaasterland, T., Lin, D., Sali, A., Studier, F.W., and Swaminathan, S. 1999. Structural genomics: Beyond the human genome project. Nat. Genet. 23151–157. - PubMed
    1. CASP4. Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction. 2000. Asilomar, Pacific Grove, CA.
    1. Chiche, L., Gregoret, L.M., Cohen, F.E., and Kollman, P.A. 1990. Protein model structure evaluation using the solvation free energy of folding. Proc. Natl. Acad. Sci. 87 3240–3243. - PMC - PubMed

Publication types

LinkOut - more resources