Fast algorithms for computing sequence distances by exhaustive substring composition

Alberto Apostolico¹, Olgert Denas

Affiliations

PMID: 18957094
PMCID: PMC2615014
DOI: 10.1186/1748-7188-3-13

Fast algorithms for computing sequence distances by exhaustive substring composition

Alberto Apostolico et al. Algorithms Mol Biol. 2008.

. 2008 Oct 28:3:13.

doi: 10.1186/1748-7188-3-13.

Authors

Alberto Apostolico¹, Olgert Denas

Affiliation

¹ Academia Nazionale dei Lincei, Rome, Italy. axa@cc.gatech.edu

PMID: 18957094
PMCID: PMC2615014
DOI: 10.1186/1748-7188-3-13

Abstract

The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background.

PubMed Disclaimer

Figures

**Figure 1**
Phylogenetic trees derived for small samples under various compositional distances.

**Figure 2**
**Phylogenetic trees derived for small samples under various compositional distances.** With a sample of "closer" species, some differences appear in the tree already for small values of k (top row): all fixed k-mers with k = 5, same as for k = 6 or 7 (left); maximal k-mers up to a maximum K = 5 (center); and all k-mers up to a maximum K = 5, same as for k = 6 or 7 (right). This is repeated with K = 15 in the second row, with K = 45 in the third row except for the middle entry set at K = 6 and K = 7, respectively, since trees then stabilize for higher K.

See this image and copyright information in PMC

References

1. von Mises R. Probability, Statistics and Truth MacMillan. 1939.
1. Brillouin L. Science and Information Theory. Academic Press; 1971.
1. Kolmogorov A. Three approaches to the quantitative definition of information. Problemi Pederachi Inf. 1965;1
1. Lempel A, Ziv J. On the complexity of finite sequences. IEEE Transactions on Information Theory. 1976;22:75–81. doi: 10.1109/TIT.1976.1055501. - DOI
1. Cover TM, Thomas JA. Elements of Information Theory. Wiley-Interscience; 1991.

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- BacDive

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fast algorithms for computing sequence distances by exhaustive substring composition

Affiliation

Fast algorithms for computing sequence distances by exhaustive substring composition

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Molecular Biology Databases