Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010;11(4):R37.
doi: 10.1186/gb-2010-11-4-r37. Epub 2010 Apr 6.

Phylogenetic assessment of alignments reveals neglected tree signal in gaps

Affiliations

Phylogenetic assessment of alignments reveals neglected tree signal in gaps

Christophe Dessimoz et al. Genome Biol. 2010.

Abstract

Background: The alignment of biological sequences is of chief importance to most evolutionary and comparative genomics studies, yet the two main approaches used to assess alignment accuracy have flaws: reference alignments are derived from the biased sample of proteins with known structure, and simulated data lack realism.

Results: Here, we introduce tree-based tests of alignment accuracy, which not only use large and representative samples of real biological data, but also enable the evaluation of the effect of gap placement on phylogenetic inference. We show that (i) the current belief that consistency-based alignments outperform scoring matrix-based alignments is misguided; (ii) gaps carry substantial phylogenetic signal, but are poorly exploited by most alignment and tree building programs; (iii) even so, excluding gaps and variable regions is detrimental; (iv) disagreement among alignment programs says little about the accuracy of resulting trees.

Conclusions: This study provides the broad community relying on sequence alignment with important practical recommendations, sets superior standards for assessing alignment accuracy, and paves the way for the development of phylogenetic inference methods of significantly higher resolution.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic of the phylogeny-based tests of alignment accuracy. Both tests are based on large-scale genomic data: (a) The species-tree discordance test samples sets of orthologs inferred by OMA among species with a well-accepted phylogeny (Additional file 1, Figure S1). Each sample is aligned by the different packages. The resulting alignments are evaluated by reconstructing trees from them, and comparing with the reference topology. All else being equal, trees from better alignment packages show higher average congruence with the reference topology. (b) The minimum duplication test follows a similar idea, but differs from the first test in two ways. First, it samples sets of homologs rather than the more specific orthologs. Second, the evaluation is based on a parsimony argument rather than knowledge about the phylogeny of the species: all else being equal, alignments yielding trees with fewer duplication nodes on average are more accurate.
Figure 2
Figure 2
Comparison of alignment methods. Assessment of various alignment methods under default parameters using (a) the species-tree discordance and (b) the minimum duplication tests, on eukaryotic data. Consistency-based alignment methods do not improve over scoring matrix-based methods. The relative performance between alignment programs is more variable for nucleotide data than for amino-acid data. On amino-acid data, Mafft-FFT-NS-2, DiAlign TX and Prank were never outperformed; on nucleotide data, Mafft L-INS-i (right column) was never outperformed (see also Additional file 1, Figure S6). Average compute times (per alignment) are plotted as triangles (amino-acids) and circles (nucleotides). Error bars correspond to ± 1 s.d. Significant difference from best alignment program is denoted with a minus symbol at the basis of relevant bars (Wilcoxon double-sided test, P < 0.01).
Figure 3
Figure 3
Phylogenetic signal of gaps. (a) Assessment of gap accuracy under default parameters using the species-tree discordance test with parsimony trees on presence/absence patterns of gap characters in aminoacid alignments. By taking into account gap information, this test demonstrates that the gap placement of Prank is significantly better than other alignment methods. This cannot be observed either using standard tree building methods (Figure 2), or using structure-based benchmarks. Error bars correspond to ± 1 s.d. Significant difference from Prank is denoted with a minus symbol at the basis of relevant bars (Wilcoxon double-sided test, P < 0.01). (b) Accuracy of maximum likelihood (ML) trees on amino-acid substitution patterns versus parsimony on binary gap presence/absence characters, on fungal data. The phylogenetic signal of gaps inferred by Prank increases with divergence. For distant sequences, the proportion of correctly inferred splits from gaps alone is close to that from amino-acids substitutions by ML. Thus, tree building methods could capture up to twice as much phylogenetic signal from the same data. Moreover, note that the crude approach used here to infer the gap trees likely understates the potential of gap patterns.
Figure 4
Figure 4
Effect of excluding gaps and variable regions. The plot shows the effect of filtering on the minimum duplication test with back-translated, fungal amino-acid alignments. Removing gapped sites tends to worsen the accuracy of the induced maximum likelihood trees. Removing variable regions in addition to gapped sites (Gblocks, default settings) drastically reduces the accuracy of reconstructed trees. Error bars correspond to ± 1 s.d. Significant difference between results from original and curated alignments is denoted with a minus symbol at the basis of relevant bars (Wilcoxon double-sided test, P < 0.01).

References

    1. Kemena C, Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 2009;25:2455–2465. doi: 10.1093/bioinformatics/btp452. - DOI - PMC - PubMed
    1. Blackshields G, Wallace IM, Larkin M, Higgins DG. Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol. 2006;6:321–339. - PubMed
    1. Edgar RC, Batzoglou S. Multiple sequence alignment. Curr Opin Struct Biol. 2006;16:368–373. doi: 10.1016/j.sbi.2006.04.004. - DOI - PubMed
    1. Notredame C. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol. 2007;3:e123. doi: 10.1371/journal.pcbi.0030123. - DOI - PMC - PubMed
    1. Thompson J, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins. 2005;61:127–136. doi: 10.1002/prot.20527. - DOI - PubMed

LinkOut - more resources