In search for more accurate alignments in the twilight zone

doi:10.1110/ps.4820102

. 2002 Jul;11(7):1702-13.

doi: 10.1110/ps.4820102.

In search for more accurate alignments in the twilight zone

Lukasz Jaroszewski¹, Weizhong Li, Adam Godzik

Affiliations

PMID: 12070323
PMCID: PMC2373660
DOI: 10.1110/ps.4820102

In search for more accurate alignments in the twilight zone

Lukasz Jaroszewski et al. Protein Sci. 2002 Jul.

. 2002 Jul;11(7):1702-13.

doi: 10.1110/ps.4820102.

Authors

Lukasz Jaroszewski¹, Weizhong Li, Adam Godzik

Affiliation

¹ Program in Bioinformatics and Biological Complexity, The Burnham Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA.

PMID: 12070323
PMCID: PMC2373660
DOI: 10.1110/ps.4820102

Abstract

A major bottleneck in comparative modeling is the alignment quality; this is especially true for proteins whose distant relationships could be reliably recognized only by recent advances in fold recognition. The best algorithms excel in recognizing distant homologs but often produce incorrect alignments for over 50% of protein pairs in large fold-prediction benchmarks. The alignments obtained by sequence-sequence or sequence-structure matching algorithms differ significantly from the structural alignments. To study this problem, we developed a simplified method to explicitly enumerate all possible alignments for a pair of proteins. This allowed us to estimate the number of significantly different alignments for a given scoring method that score better than the structural alignment. Using several examples of distantly related proteins, we show that for standard sequence-sequence alignment methods, the number of significantly different alignments is usually large, often about 10(10) alternatives. This distance decreases when the alignment method is improved, but the number is still too large for the brute force enumeration approach. More effective strategies were needed, so we evaluated and compared two well-known approaches for searching the space of suboptimal alignments. We combined their best features and produced a hybrid method, which yielded alignments that surpassed the original alignments for about 50% of protein pairs with minimal computational effort.

PubMed Disclaimer

Figures

**Fig. 1.**
LGA (Zemla et al. 1999) structural alignments of the models submitted by the predictors with the real structures of the two CASP4 (Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction) targets. The discrepancies between these alignments reflect the discrepancies between the alignments used for homology modeling. Real structure (the case of 100% correct prediction) would be the diagonal straight line on this plot. (A) T0117 (AF185268) is deoxyribonucleoside kinase from *Drosophila melanogaster*. (B) T0109 (P45340) is oligoribonuclease from *Haemophilus influenzae*.

**Fig. 2.**
The distribution of discrepancies between the different alignments as a function of sequence identity. Alignment discrepancy is measured as the percentage of differently aligned residues in the shorter of two alignments. The discrepancies have been calculated for a comprehensive benchmark of protein pairs consisting of 742 protein pairs selected from the Structural Classification of Proteins (SCOP) database. (A) PSI-BLAST (Altschul et al. 1997) alignments versus FASTA (Pearson and Lipman 1988) alignments. (B) PSI-BLAST alignments versus CE (Shindyalov and Bourne 1998) structural alignments.

**Fig. 3.**
Fold and Function Assignment System (FFAS) similarity matrix (Rychlewski et al. 2000) calculated for *1r69* and *1lccA* sequences and presented as a surface plot (blue colors mean higher similarity; red colors mean lower similarity). (A similarity matrix is a matrix describing a similarity score assigned to each pair of potentially aligned residues. Here, the X-axis corresponds to the query sequence and the Y-axis corresponds to the target sequence.) The picture illustrates an obvious discrepancy between the C-terminal fragments of the best-scoring Fold and Function Assignment System (FFAS) alignment (shown as a black path on the A1 matrix surface) and the CE structural alignment (shown in blue). The best suboptimal alignment (shown in pink) overlaps with 90% of the structural alignment. Root mean square deviation values of the FFAS alignment, the best suboptimal alignment, and the structural alignment are 3.6, 2.6, and 2.1 Å, respectively. All three alignments correctly assign the second and third helix of *1r69* to the first and second from *1lccA*, but the lowest-scoring FFAS alignment incorrectly embraces the C-terminal part of the last helix from *1lccA. 1lccA* is the N-terminal domain of the Lac repressor (*LacR*) from *Escherichia coli. 1r69* is the DNA-binding domain of the C1 repressor from *E. coli*-derived *Phage 434.* Both proteins belong to the same structural superfamily in the SCOP database.

**Fig. 4.**
The subsets of suboptimal alignments as explored with the parametric method (circles) and the iterative method (dots). In addition, CE structural alignment is shown (black line).

**Fig. 5.**
Applying the suboptimal alignment calculations. This graph illustrates the discrepancies between the original FFAS alignments and the CE structural alignments. The best suboptimal alignment is also shown in the graph. (A) *1bbt* is foot-and-mouth disease virus protein; *1smv* is sesbania mosaic virus coat protein. (B) *1bdm* is malate dehydrogenase; *1dih* is dihydrodipicolinate reductase.

See this image and copyright information in PMC

Cited by

Improving the quality of protein structure models by selecting from alignment alternatives.
Sommer I, Toppo S, Sander O, Lengauer T, Tosatto SC. Sommer I, et al. BMC Bioinformatics. 2006 Jul 27;7:364. doi: 10.1186/1471-2105-7-364. BMC Bioinformatics. 2006. PMID: 16872519 Free PMC article.
Distance matrix-based approach to protein structure prediction.
Kloczkowski A, Jernigan RL, Wu Z, Song G, Yang L, Kolinski A, Pokarowski P. Kloczkowski A, et al. J Struct Funct Genomics. 2009 Mar;10(1):67-81. doi: 10.1007/s10969-009-9062-2. Epub 2009 Feb 18. J Struct Funct Genomics. 2009. PMID: 19224393 Free PMC article.
All are not equal: a benchmark of different homology modeling programs.
Wallner B, Elofsson A. Wallner B, et al. Protein Sci. 2005 May;14(5):1315-27. doi: 10.1110/ps.041253405. Protein Sci. 2005. PMID: 15840834 Free PMC article.
ProbCons: Probabilistic consistency-based multiple sequence alignment.
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. Do CB, et al. Genome Res. 2005 Feb;15(2):330-40. doi: 10.1101/gr.2821705. Genome Res. 2005. PMID: 15687296 Free PMC article.
Spatiotemporal control of spindle midzone formation by PRC1 in human cells.
Zhu C, Lau E, Schwarzenbacher R, Bossy-Wetzel E, Jiang W. Zhu C, et al. Proc Natl Acad Sci U S A. 2006 Apr 18;103(16):6196-201. doi: 10.1073/pnas.0506926103. Epub 2006 Apr 7. Proc Natl Acad Sci U S A. 2006. PMID: 16603632 Free PMC article.

See all "Cited by" articles

References

1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic. Acids Res. 25 3389–3402. - PMC - PubMed
1. Berman, H.M., Bhat, T.N., Bourne, P.E., Feng, Z., Gilliland, G., Weissig, H., and Westbrook, J. 2000. The Protein Data Bank and the challenge of structural genomics. Nat. Struct. Biol. 7 957–959. - PubMed
1. Burley, S.K., Almo, S.C., Bonanno, J.B., Capel, M., Chance, M.R., Gaasterland, T., Lin, D., Sali, A., Studier, F.W., and Swaminathan, S. 1999. Structural genomics: Beyond the human genome project. Nat. Genet. 23151–157. - PubMed
1. CASP4. Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction. 2000. Asilomar, Pacific Grove, CA.
1. Chiche, L., Gregoret, L.M., Cohen, F.E., and Kollman, P.A. 1990. Protein model structure evaluation using the solvation free energy of folding. Proc. Natl. Acad. Sci. 87 3240–3243. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

GM 60049/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic. Acids Res. 25 3389–3402. - PMC - PubMed

[2] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic. Acids Res. 25 3389–3402. - PMC - PubMed

[3] Berman, H.M., Bhat, T.N., Bourne, P.E., Feng, Z., Gilliland, G., Weissig, H., and Westbrook, J. 2000. The Protein Data Bank and the challenge of structural genomics. Nat. Struct. Biol. 7 957–959. - PubMed

[4] Berman, H.M., Bhat, T.N., Bourne, P.E., Feng, Z., Gilliland, G., Weissig, H., and Westbrook, J. 2000. The Protein Data Bank and the challenge of structural genomics. Nat. Struct. Biol. 7 957–959. - PubMed

[5] Burley, S.K., Almo, S.C., Bonanno, J.B., Capel, M., Chance, M.R., Gaasterland, T., Lin, D., Sali, A., Studier, F.W., and Swaminathan, S. 1999. Structural genomics: Beyond the human genome project. Nat. Genet. 23151–157. - PubMed

[6] Burley, S.K., Almo, S.C., Bonanno, J.B., Capel, M., Chance, M.R., Gaasterland, T., Lin, D., Sali, A., Studier, F.W., and Swaminathan, S. 1999. Structural genomics: Beyond the human genome project. Nat. Genet. 23151–157. - PubMed

[7] CASP4. Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction. 2000. Asilomar, Pacific Grove, CA.

[8] CASP4. Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction. 2000. Asilomar, Pacific Grove, CA.

[9] Chiche, L., Gregoret, L.M., Cohen, F.E., and Kollman, P.A. 1990. Protein model structure evaluation using the solvation free energy of folding. Proc. Natl. Acad. Sci. 87 3240–3243. - PMC - PubMed

[10] Chiche, L., Gregoret, L.M., Cohen, F.E., and Kollman, P.A. 1990. Protein model structure evaluation using the solvation free energy of folding. Proc. Natl. Acad. Sci. 87 3240–3243. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

In search for more accurate alignments in the twilight zone

Affiliation

In search for more accurate alignments in the twilight zone

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources