Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2010 Sep 1;149(3):120-6.
doi: 10.1016/j.jbiotec.2010.05.006. Epub 2010 Jun 2.

Efficient tools for comparative substring analysis

Affiliations
Comparative Study

Efficient tools for comparative substring analysis

Alberto Apostolico et al. J Biotechnol. .

Abstract

This paper introduces an efficient implementation of approaches to alignment-free comparative genome analysis and genome-based phylogeny relying on substring composition. Distances derived from substring statistics have been proposed recently as a meaningful alternative to distances derived from sequence alignment. In particular, procaryote phylogenies based on comparative 5- and 6-mer analysis of whole proteomes have successfully been worked out. The present implementation extends the computation of composition-based distances so as to involve allk-mers for anyk up to any preset m aximum length K (including K=infinity). Remarkably, although there may be Theta(L(2)) distinct strings that occur in a given sequence of length L (and Theta(KL) of length k< or =K), it is shown that composition-based distances as well as many other details of interest in comparative genome analysis can be computed in O(L) time and space (with a constant that is independent of the size of K, that is, the same constant works for all K). A typical run with 2 sequences of altogether 1.5 million characters computes their composition-based distance in about 2s, a performance to be contrasted with the several hours needed, even when restricting attention to substrings of length at most 6, by the direct method in use. This paper.

PubMed Disclaimer

Publication types

LinkOut - more resources