. 2010 Aug;27(8):1759-67.

doi: 10.1093/molbev/msq066. Epub 2010 Mar 5.

An alignment confidence score capturing robustness to guide tree uncertainty

Osnat Penn¹, Eyal Privman, Giddy Landan, Dan Graur, Tal Pupko

Affiliations

PMID: 20207713
PMCID: PMC2908709
DOI: 10.1093/molbev/msq066

An alignment confidence score capturing robustness to guide tree uncertainty

Osnat Penn et al. Mol Biol Evol. 2010 Aug.

. 2010 Aug;27(8):1759-67.

doi: 10.1093/molbev/msq066. Epub 2010 Mar 5.

Authors

Osnat Penn¹, Eyal Privman, Giddy Landan, Dan Graur, Tal Pupko

Affiliation

¹ Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel.

PMID: 20207713
PMCID: PMC2908709
DOI: 10.1093/molbev/msq066

Abstract

Multiple sequence alignment (MSA) is the basis for a wide range of comparative sequence analyses from molecular phylogenetics to 3D structure prediction. Sophisticated algorithms have been developed for sequence alignment, but in practice, many errors can be expected and extensive portions of the MSA are unreliable. Hence, it is imperative to understand and characterize the various sources of errors in MSAs and to quantify site-specific alignment confidence. In this paper, we show that uncertainties in the guide tree used by progressive alignment methods are a major source of alignment uncertainty. We use this insight to develop a novel method for quantifying the robustness of each alignment column to guide tree uncertainty. We build on the widely used bootstrap method for perturbing the phylogenetic tree. Specifically, we generate a collection of trees and use each as a guide tree in the alignment algorithm, thus producing a set of MSAs. We next test the consistency of every column of the MSA obtained from the unperturbed guide tree with respect to the set of MSAs. We name this measure the "GUIDe tree based AligNment ConfidencE" (GUIDANCE) score. Using the Benchmark Alignment data BASE benchmark as well as simulation studies, we show that GUIDANCE scores accurately identify errors in MSAs. Additionally, we compare our results with the previously published Heads-or-Tails score and show that the GUIDANCE score is a better predictor of unreliably aligned regions.

PubMed Disclaimer

Figures

F<sc>IG</sc>. 1. — **FIG. 1.**
The “GUIDANCE” measure. A base MSA is produced by any progressive alignment method. Bootstrap NJ trees are reconstructed and given as guide trees to the progressive alignment program, producing a set of perturbed MSAs. Sum-of-pairs scores are then calculated by comparing each perturbed MSA with the base MSA and are color coded on each residue in the alignment.

F<sc>IG</sc>. 2. — **FIG. 2.**
Agreement between MSAs built based on perturbed bootstrap trees and the base MSA for MAFFT and ClustalW alignments of *Drosophila melanogaster* chemoreceptor sequences. Box plots summarize medians, quartiles, and range of (A) column scores and (B) sum-of-pairs scores.

F<sc>IG</sc>. 3. — **FIG. 3.**
Accuracy of GUIDANCE scores in identifying alignment errors. ROC curves for HoT scores (red) and GUIDANCE scores (blue) of aligned residue pairs relative to the BAliBASE benchmark (A) and the simulation benchmark (B).

F<sc>IG</sc>. 4. — **FIG. 4.**
An example from the simulation benchmark. Distribution of GUIDANCE column scores (blue) compared with HoT scores (red) and the actual alignment accuracy (green) in the first 260 columns of a typical simulated alignment.

F<sc>IG</sc>. 5. — **FIG. 5.**
Venn diagram of alignment error detection by the GUIDANCE and HoT scores. A total of 1,914,804 incorrectly aligned residue pairs in the BAliBASE benchmark were classified as detected by either method if their confidence score was less than 1. GUIDANCE detected 95.9% of the errors, whereas HoT detected less than 87%, and the HoT-detected errors are nearly a subset of the GUIDANCE-detected errors.

F<sc>IG</sc>. 6. — **FIG. 6.**
Comparison with Gblocks. The false-positive and true-positive rates of Gblocks “stringent” (red) and “relaxed” (green) parameter sets in comparison with a ROC curve for GUIDANCE column scores (blue) for the simulation benchmark.

F<sc>IG</sc>. 7. — **FIG. 7.**
Color-coded GUIDANCE scores for *Drosophila melanogaster* chemoreceptor sequences. A portion of the MSA is presented (columns 757–875 of 32 sequences). Confidently aligned residues are colored in shades of magenta and pink, whereas uncertain residues are colored in shades of blue. GUIDANCE column scores are plotted below the alignment.

See this image and copyright information in PMC

Cited by

Early bioenergetic evolution.
Sousa FL, Thiergart T, Landan G, Nelson-Sathi S, Pereira IA, Allen JF, Lane N, Martin WF. Sousa FL, et al. Philos Trans R Soc Lond B Biol Sci. 2013 Jun 10;368(1622):20130088. doi: 10.1098/rstb.2013.0088. Print 2013 Jul 19. Philos Trans R Soc Lond B Biol Sci. 2013. PMID: 23754820 Free PMC article. Review.
Improving multiple sequence alignment by using better guide trees.
Zhan Q, Ye Y, Lam TW, Yiu SM, Wang Y, Ting HF. Zhan Q, et al. BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S4. doi: 10.1186/1471-2105-16-S5-S4. Epub 2015 Mar 18. BMC Bioinformatics. 2015. PMID: 25859903 Free PMC article.
Alignment-Integrated Reconstruction of Ancestral Sequences Improves Accuracy.
Aadland K, Kolaczkowski B. Aadland K, et al. Genome Biol Evol. 2020 Sep 1;12(9):1549-1565. doi: 10.1093/gbe/evaa164. Genome Biol Evol. 2020. PMID: 32785673 Free PMC article.
Molecular evolution of juvenile hormone esterase-like proteins in a socially exchanged fluid.
LeBoeuf AC, Cohanim AB, Stoffel C, Brent CS, Waridel P, Privman E, Keller L, Benton R. LeBoeuf AC, et al. Sci Rep. 2018 Dec 13;8(1):17830. doi: 10.1038/s41598-018-36048-1. Sci Rep. 2018. PMID: 30546082 Free PMC article.
Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters using EMERALD.
Grigorjew A, Gynter A, Dias FHC, Buchfink B, Drost HG, Tomescu AI. Grigorjew A, et al. Genome Biol. 2023 Jul 17;24(1):168. doi: 10.1186/s13059-023-03008-6. Genome Biol. 2023. PMID: 37461051 Free PMC article.

See all "Cited by" articles

References

1. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L. Fast statistical alignment. PLoS Comput Biol. 2009;5:e1000392. - PMC - PubMed
1. Carrillo H, Lipman D. The multiple sequence alignment problem in biology. SIAM J Appl Math. 1988;48:1073–1082.
1. Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000;17:540–552. - PubMed
1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. - PMC - PubMed
1. Fawcett T. An introduction to ROC analysis. Pattern Recog Lett. 2006;27:861–874.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An alignment confidence score capturing robustness to guide tree uncertainty

Affiliation

An alignment confidence score capturing robustness to guide tree uncertainty

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources