Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Feb 25:2:359-75.

Pattern-based phylogenetic distance estimation and tree reconstruction

Affiliations

Pattern-based phylogenetic distance estimation and tree reconstruction

Michael Höhl et al. Evol Bioinform Online. .

Abstract

We have developed an alignment-free method that calculates phylogenetic distances using a maximum-likelihood approach for a model of sequence change on patterns that are discovered in unaligned sequences. To evaluate the phylogenetic accuracy of our method, and to conduct a comprehensive comparison of existing alignment-free methods (freely available as Python package decaf + py at http://www.bioinformatics.org.au), we have created a data set of reference trees covering a wide range of phylogenetic distances. Amino acid sequences were evolved along the trees and input to the tested methods; from their calculated distances we infered trees whose topologies we compared to the reference trees.We find our pattern-based method statistically superior to all other tested alignment-free methods. We also demonstrate the general advantage of alignment-free methods over an approach based on automated alignments when sequences violate the assumption of collinearity. Similarly, we compare methods on empirical data from an existing alignment benchmark set that we used to derive reference distances and trees. Our pattern-based approach yields distances that show a linear relationship to reference distances over a substantially longer range than other alignment-free methods. The pattern-based approach outperforms alignment-free methods and its phylogenetic accuracy is statistically indistinguishable from alignment-based distances.

Keywords: alignment-free methods; distance estimation; pattern discovery; phylogenetics.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Average Robinson-Foulds distance (Y-axis) for two methods (a,c,e: dE; b,d,f: dC) on three reference sets (top to bottom: small, medium and large phylogenetic distances). Each subfigure shows the behaviour as a function of word length k (X-axis) under two alphabets (AA: original amino acids, CE: chemical equivalence classes). Points are joined for ease of visual inspection only. The expectation of a tree reconstruction method based on random choice is 23, or about 0.67.
Figure 2
Figure 2
Pairwise phylogenetic reference distances (X-axis) plotted against corresponding calculated distances (Y-axis). Methods and parameters are as follows: a) dPB with L = 4, W = 16, CE, b) dPB with L = 4, W = 16, AA, c) dPBMC with L = 4, W = 16, CE, d) dE with k = 4, AA, e) dS with k = 4, AA, f) dF with k = 4, AA, g) dP with k = 4, AA, h) dC with k = 3, AA, i) dC with k = 5, CE, j) dW, k) dLZ with AA, l) dACS with A A.
Figure 2
Figure 2
Pairwise phylogenetic reference distances (X-axis) plotted against corresponding calculated distances (Y-axis). Methods and parameters are as follows: a) dPB with L = 4, W = 16, CE, b) dPB with L = 4, W = 16, AA, c) dPBMC with L = 4, W = 16, CE, d) dE with k = 4, AA, e) dS with k = 4, AA, f) dF with k = 4, AA, g) dP with k = 4, AA, h) dC with k = 3, AA, i) dC with k = 5, CE, j) dW, k) dLZ with AA, l) dACS with A A.
Figure 3
Figure 3
Average Robinson-Foulds distance (Y-axis) for two methods (a,c,e: dS; b,d,f: dF) on three reference sets (top to bottom: small, medium and large phylogenetic distances). Each subfigure shows the behaviour as a function of word length k (X-axis) under two alphabets (AA: original amino acids, CE: chemical equivalence classes). Points are joined for ease of visual inspection only. The expectation of a tree reconstruction method based on random choice is 23, or about 0.67.
Figure 4
Figure 4
Average Robinson-Foulds distance (Y-axis) for method dP on three reference sets (top to bottom: small, medium and large phylogenetic distances). Each subfigure shows the behaviour as a function of word length k (X-axis) under two alphabets (AA: original amino acids, CE: chemical equivalence classes). Points are joined for ease of visual inspection only. The expectation of a tree reconstruction method based on random choice is 23, or about 0.67.

Similar articles

Cited by

References

    1. Apostolico A, Comin M, Parida L. Conservative extraction of overrepresented extensible motifs. In. Proceedings of the 13th International Conference on Intelligent Systems for Molecular Biology (ISMB 2005); 2005. pp. 223–233.
    1. Blaisdell B. A measure of the similarity of sets of sequences not requiring sequence alignment. . Proc. Natl Acad. Sci. U. S. A. 1986;83(14):5155–5159. - PMC - PubMed
    1. Blaisdell B. Average values of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch counts requiring sequence alignment for a computer-generated model system. . J. Mol. Evol. 1989;29(6):538–547. - PubMed
    1. Burstein D, Ulitsky I, Tuller T, Chor B. Information theoretic approaches to whole genome phylogenies. In. Proceedings of the Ninth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2005); Cambridge, MA: 2005. pp. 283–295.
    1. Do C, Mahabhashyam M, Brudno M, Batzoglou S. ProbCons: probabilistic consistency-based multiple sequence alignment. . Genome Res. 2005;15(2):330–340. - PMC - PubMed

LinkOut - more resources