Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Aug;27(8):1759-67.
doi: 10.1093/molbev/msq066. Epub 2010 Mar 5.

An alignment confidence score capturing robustness to guide tree uncertainty

Affiliations

An alignment confidence score capturing robustness to guide tree uncertainty

Osnat Penn et al. Mol Biol Evol. 2010 Aug.

Abstract

Multiple sequence alignment (MSA) is the basis for a wide range of comparative sequence analyses from molecular phylogenetics to 3D structure prediction. Sophisticated algorithms have been developed for sequence alignment, but in practice, many errors can be expected and extensive portions of the MSA are unreliable. Hence, it is imperative to understand and characterize the various sources of errors in MSAs and to quantify site-specific alignment confidence. In this paper, we show that uncertainties in the guide tree used by progressive alignment methods are a major source of alignment uncertainty. We use this insight to develop a novel method for quantifying the robustness of each alignment column to guide tree uncertainty. We build on the widely used bootstrap method for perturbing the phylogenetic tree. Specifically, we generate a collection of trees and use each as a guide tree in the alignment algorithm, thus producing a set of MSAs. We next test the consistency of every column of the MSA obtained from the unperturbed guide tree with respect to the set of MSAs. We name this measure the "GUIDe tree based AligNment ConfidencE" (GUIDANCE) score. Using the Benchmark Alignment data BASE benchmark as well as simulation studies, we show that GUIDANCE scores accurately identify errors in MSAs. Additionally, we compare our results with the previously published Heads-or-Tails score and show that the GUIDANCE score is a better predictor of unreliably aligned regions.

PubMed Disclaimer

Figures

F<sc>IG</sc>. 1.
FIG. 1.
The “GUIDANCE” measure. A base MSA is produced by any progressive alignment method. Bootstrap NJ trees are reconstructed and given as guide trees to the progressive alignment program, producing a set of perturbed MSAs. Sum-of-pairs scores are then calculated by comparing each perturbed MSA with the base MSA and are color coded on each residue in the alignment.
F<sc>IG</sc>. 2.
FIG. 2.
Agreement between MSAs built based on perturbed bootstrap trees and the base MSA for MAFFT and ClustalW alignments of Drosophila melanogaster chemoreceptor sequences. Box plots summarize medians, quartiles, and range of (A) column scores and (B) sum-of-pairs scores.
F<sc>IG</sc>. 3.
FIG. 3.
Accuracy of GUIDANCE scores in identifying alignment errors. ROC curves for HoT scores (red) and GUIDANCE scores (blue) of aligned residue pairs relative to the BAliBASE benchmark (A) and the simulation benchmark (B).
F<sc>IG</sc>. 4.
FIG. 4.
An example from the simulation benchmark. Distribution of GUIDANCE column scores (blue) compared with HoT scores (red) and the actual alignment accuracy (green) in the first 260 columns of a typical simulated alignment.
F<sc>IG</sc>. 5.
FIG. 5.
Venn diagram of alignment error detection by the GUIDANCE and HoT scores. A total of 1,914,804 incorrectly aligned residue pairs in the BAliBASE benchmark were classified as detected by either method if their confidence score was less than 1. GUIDANCE detected 95.9% of the errors, whereas HoT detected less than 87%, and the HoT-detected errors are nearly a subset of the GUIDANCE-detected errors.
F<sc>IG</sc>. 6.
FIG. 6.
Comparison with Gblocks. The false-positive and true-positive rates of Gblocks “stringent” (red) and “relaxed” (green) parameter sets in comparison with a ROC curve for GUIDANCE column scores (blue) for the simulation benchmark.
F<sc>IG</sc>. 7.
FIG. 7.
Color-coded GUIDANCE scores for Drosophila melanogaster chemoreceptor sequences. A portion of the MSA is presented (columns 757–875 of 32 sequences). Confidently aligned residues are colored in shades of magenta and pink, whereas uncertain residues are colored in shades of blue. GUIDANCE column scores are plotted below the alignment.

Similar articles

Cited by

References

    1. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L. Fast statistical alignment. PLoS Comput Biol. 2009;5:e1000392. - PMC - PubMed
    1. Carrillo H, Lipman D. The multiple sequence alignment problem in biology. SIAM J Appl Math. 1988;48:1073–1082.
    1. Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000;17:540–552. - PubMed
    1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. - PMC - PubMed
    1. Fawcett T. An introduction to ROC analysis. Pattern Recog Lett. 2006;27:861–874.

Publication types