Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Apr;56(2):206-21.
doi: 10.1080/10635150701294741.

Is multiple-sequence alignment required for accurate inference of phylogeny?

Affiliations

Is multiple-sequence alignment required for accurate inference of phylogeny?

Michael Höhl et al. Syst Biol. 2007 Apr.

Abstract

The process of inferring phylogenetic trees from molecular sequences almost always starts with a multiple alignment of these sequences but can also be based on methods that do not involve multiple sequence alignment. Very little is known about the accuracy with which such alignment-free methods recover the correct phylogeny or about the potential for increasing their accuracy. We conducted a large-scale comparison of ten alignment-free methods, among them one new approach that does not calculate distances and a faster variant of our pattern-based approach; all distance-based alignment-free methods are freely available from http://www.bioinformatics.org.au (as Python package decaf+py). We show that most methods exhibit a higher overall reconstruction accuracy in the presence of high among-site rate variation. Under all conditions that we considered, variants of the pattern-based approach were significantly better than the other alignment-free methods. The new pattern-based variant achieved a speed-up of an order of magnitude in the distance calculation step, accompanied by a small loss of tree reconstruction accuracy. A method of Bayesian inference from k-mers did not improve on classical alignment-free (and distance-based) methods but may still offer other advantages due to its Bayesian nature. We found the optimal word length k of word-based methods to be stable across various data sets, and we provide parameter ranges for two different alphabets. The influence of these alphabets was analyzed to reveal a trade-off in reconstruction accuracy between long and short branches. We have mapped the phylogenetic accuracy for many alignment-free methods, among them several recently introduced ones, and increased our understanding of their behavior in response to biologically important parameters. In all experiments, the pattern-based approach emerged as superior, at the expense of higher resource consumption. Nonetheless, no alignment-free method that we examined recovers the correct phylogeny as accurately as does an approach based on maximum-likelihood distance estimates of multiply aligned sequences.

PubMed Disclaimer

Figures

Figure 1
Figure 1
RF distance landscape for method B-bin. Average RF distance (y-axis) of method B-bin on three reference sets (top to bottom: set 2, set 4, and set 6) of two synthetic data sets (a, c, e: control; b, d, f: ASRV). Each subfigure shows the behavior as a function of word length k (x-axis) for two alphabets (AA: original amino acids, CE: chemical equivalence classes). Points are joined for ease of visual inspection only.
Figure 2
Figure 2
Average RF distance for six methods. Average RF distance (y-axis) for six selected methods on all seven reference sets (x-axis) of two synthetic data sets (a: control; b: ASRV). For each data set, we show (1) the ML distance estimate based on correct alignments, (2) the best pattern-based variant, (3 and 4) the best word-based method and the best method not based on words, 5) the best composition distance; and (6) the W-metric; the numbers in the inserted legends refer to the far left-hand column of Tables 1 (Figure 2a) and 2 (Figure 2b) respectively.

Similar articles

Cited by

References

    1. Beiko R. G., Chan C. X., Ragan M. A. A word-oriented approach to alignment validation. Bioinformatics. 2005;21:2230–2239. - PubMed
    1. Beiko R. G., Harlow T. J., Ragan M. A. Highways of gene sharing in prokaryotes. Proc. Natl Acad. Sci. USA. 2005;102:14332–14337. - PMC - PubMed
    1. Beiko R. G., Keith J. M., Harlow T. J., Ragan M. A. Searching for convergence in phylogenetic Markov chain Monte Carlo. Syst. Biol. 2006;55:553–565. - PubMed
    1. Blaisdell B. E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl Acad. Sci. USA. 1986;83:5155–5159. - PMC - PubMed
    1. Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 2000;17:540–552. - PubMed

Publication types