Comparative Study

. 2006 Dec 18;7 Suppl 5(Suppl 5):S21.

doi: 10.1186/1471-2105-7-S5-S21.

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

Sylvain Forêt¹, Miriam R Kantorovitz, Conrad J Burden

Affiliations

PMID: 17254306
PMCID: PMC1764478
DOI: 10.1186/1471-2105-7-S5-S21

Comparative Study

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

Sylvain Forêt et al. BMC Bioinformatics. 2006.

. 2006 Dec 18;7 Suppl 5(Suppl 5):S21.

doi: 10.1186/1471-2105-7-S5-S21.

Authors

Sylvain Forêt¹, Miriam R Kantorovitz, Conrad J Burden

Affiliation

¹ Mathematical Sciences Institute, Australian National University, Canberra, ACT 0200, Australia. sylvain.foret@rsbs.anu.edu.au

PMID: 17254306
PMCID: PMC1764478
DOI: 10.1186/1471-2105-7-S5-S21

Abstract

Background: The number of k-words shared between two sequences is a simple and efficient alignment-free sequence comparison method. This statistic, D2, has been used for the clustering of EST sequences. Sequence comparison based on D2 is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D2, and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied.

Results: We have computed the D2 optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D2 to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D2 statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA.

Conclusion: We have characterized the distribution of the D2 statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D2-based methods are most appropriate for large genomic sequences.

PubMed Disclaimer

Figures

**Figure 1**
**Kolmogorov-Smirnov p values for non-uniform** **compared with normal**. (adapted from [13]) The letter distribution is *f_A*= *f_T=*, *f_G*= *f_C*= , and the diagonal red line in each table is k = 2n + const.

formula image — **Figure 1**
**Kolmogorov-Smirnov p values for non-uniform** **compared with normal**. (adapted from [13]) The letter distribution is *f_A*= *f_T=*, *f_G*= *f_C*= , and the diagonal red line in each table is k = 2n + const.

**Figure 2**
**Effect of the number of mismatches on the accuracy of** . The curves show the accuracy of the statistics according to the word length and for various mismatches per word. In this simulation, sequence sizes of 600 letters were used. The dotted line shows the results obtained with BLAST.

See this image and copyright information in PMC

References

1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
1. Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990;183:63–98. - PubMed
1. Kent WJ. BLAT-the BLAST-like alignment tool. Genome Res. 2002;12:656–64. 10.1101/gr.229202. Article published online before March 2002. - PMC - PubMed
1. Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998;8:967–74. - PMC - PubMed
1. Vinga S, Almeida J. Alignment-free sequence comparison-a review. Bioinformatics. 2003;19:513–23. doi: 10.1093/bioinformatics/btg005. - DOI - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

Affiliation

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials