Phylogenetic assessment of alignments reveals neglected tree signal in gaps

Christophe Dessimoz¹, Manuel Gil

Affiliations

PMID: 20370897
PMCID: PMC2884540
DOI: 10.1186/gb-2010-11-4-r37

Phylogenetic assessment of alignments reveals neglected tree signal in gaps

Christophe Dessimoz et al. Genome Biol. 2010.

. 2010;11(4):R37.

doi: 10.1186/gb-2010-11-4-r37. Epub 2010 Apr 6.

Authors

Christophe Dessimoz¹, Manuel Gil

Affiliation

¹ Department of Computer Science, ETH Zurich, Universitaetstr, 6, 8092 Zürich, Switzerland. cdessimoz@inf.ethz.ch

PMID: 20370897
PMCID: PMC2884540
DOI: 10.1186/gb-2010-11-4-r37

Abstract

Background: The alignment of biological sequences is of chief importance to most evolutionary and comparative genomics studies, yet the two main approaches used to assess alignment accuracy have flaws: reference alignments are derived from the biased sample of proteins with known structure, and simulated data lack realism.

Results: Here, we introduce tree-based tests of alignment accuracy, which not only use large and representative samples of real biological data, but also enable the evaluation of the effect of gap placement on phylogenetic inference. We show that (i) the current belief that consistency-based alignments outperform scoring matrix-based alignments is misguided; (ii) gaps carry substantial phylogenetic signal, but are poorly exploited by most alignment and tree building programs; (iii) even so, excluding gaps and variable regions is detrimental; (iv) disagreement among alignment programs says little about the accuracy of resulting trees.

Conclusions: This study provides the broad community relying on sequence alignment with important practical recommendations, sets superior standards for assessing alignment accuracy, and paves the way for the development of phylogenetic inference methods of significantly higher resolution.

PubMed Disclaimer

Figures

**Figure 1**
**Schematic of the phylogeny-based tests of alignment accuracy**. Both tests are based on large-scale genomic data: **(a)** The species-tree discordance test samples sets of orthologs inferred by OMA among species with a well-accepted phylogeny (Additional file 1, Figure S1). Each sample is aligned by the different packages. The resulting alignments are evaluated by reconstructing trees from them, and comparing with the reference topology. All else being equal, trees from better alignment packages show higher average congruence with the reference topology. **(b)** The minimum duplication test follows a similar idea, but differs from the first test in two ways. First, it samples sets of homologs rather than the more specific orthologs. Second, the evaluation is based on a parsimony argument rather than knowledge about the phylogeny of the species: all else being equal, alignments yielding trees with fewer duplication nodes on average are more accurate.

**Figure 2**
**Comparison of alignment methods**. Assessment of various alignment methods under default parameters using **(a)** the species-tree discordance and **(b)** the minimum duplication tests, on eukaryotic data. Consistency-based alignment methods do not improve over scoring matrix-based methods. The relative performance between alignment programs is more variable for nucleotide data than for amino-acid data. On amino-acid data, Mafft-FFT-NS-2, DiAlign TX and Prank were never outperformed; on nucleotide data, Mafft L-INS-i (right column) was never outperformed (see also Additional file 1, Figure S6). Average compute times (per alignment) are plotted as triangles (amino-acids) and circles (nucleotides). Error bars correspond to ± 1 s.d. Significant difference from best alignment program is denoted with a minus symbol at the basis of relevant bars (Wilcoxon double-sided test, P < 0.01).

**Figure 3**
**Phylogenetic signal of gaps**. **(a)** Assessment of gap accuracy under default parameters using the species-tree discordance test with parsimony trees on presence/absence patterns of gap characters in aminoacid alignments. By taking into account gap information, this test demonstrates that the gap placement of Prank is significantly better than other alignment methods. This cannot be observed either using standard tree building methods (Figure 2), or using structure-based benchmarks. Error bars correspond to ± 1 s.d. Significant difference from Prank is denoted with a minus symbol at the basis of relevant bars (Wilcoxon double-sided test, P < 0.01). **(b)** Accuracy of maximum likelihood (ML) trees on amino-acid substitution patterns versus parsimony on binary gap presence/absence characters, on fungal data. The phylogenetic signal of gaps inferred by Prank increases with divergence. For distant sequences, the proportion of correctly inferred splits from gaps alone is close to that from amino-acids substitutions by ML. Thus, tree building methods could capture up to twice as much phylogenetic signal from the same data. Moreover, note that the crude approach used here to infer the gap trees likely understates the potential of gap patterns.

**Figure 4**
**Effect of excluding gaps and variable regions**. The plot shows the effect of filtering on the minimum duplication test with back-translated, fungal amino-acid alignments. Removing gapped sites tends to worsen the accuracy of the induced maximum likelihood trees. Removing variable regions in addition to gapped sites (Gblocks, default settings) drastically reduces the accuracy of reconstructed trees. Error bars correspond to ± 1 s.d. Significant difference between results from original and curated alignments is denoted with a minus symbol at the basis of relevant bars (Wilcoxon double-sided test, P < 0.01).

See this image and copyright information in PMC

References

1. Kemena C, Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 2009;25:2455–2465. doi: 10.1093/bioinformatics/btp452. - DOI - PMC - PubMed
1. Blackshields G, Wallace IM, Larkin M, Higgins DG. Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol. 2006;6:321–339. - PubMed
1. Edgar RC, Batzoglou S. Multiple sequence alignment. Curr Opin Struct Biol. 2006;16:368–373. doi: 10.1016/j.sbi.2006.04.004. - DOI - PubMed
1. Notredame C. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol. 2007;3:e123. doi: 10.1371/journal.pcbi.0030123. - DOI - PMC - PubMed
1. Thompson J, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins. 2005;61:127–136. doi: 10.1002/prot.20527. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Phylogenetic assessment of alignments reveals neglected tree signal in gaps

Affiliation

Phylogenetic assessment of alignments reveals neglected tree signal in gaps

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous