Alignment-free sequence comparison (II): theoretical power of comparison statistics

Lin Wan¹, Gesine Reinert, Fengzhu Sun, Michael S Waterman

Affiliations

PMID: 20973742
PMCID: PMC3123933
DOI: 10.1089/cmb.2010.0056

Alignment-free sequence comparison (II): theoretical power of comparison statistics

Lin Wan et al. J Comput Biol. 2010 Nov.

. 2010 Nov;17(11):1467-90.

doi: 10.1089/cmb.2010.0056. Epub 2010 Oct 25.

Authors

Lin Wan¹, Gesine Reinert, Fengzhu Sun, Michael S Waterman

Affiliation

¹ Molecular and Computational Biology, University of Southern California , Los Angeles, California 90089-2910, USA.

PMID: 20973742
PMCID: PMC3123933
DOI: 10.1089/cmb.2010.0056

Abstract

Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2*approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2* for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.

PubMed Disclaimer

Figures

**FIG. 1.**
The values of C(λ) and C^*(λ) (upper panels) and the power of D2 and (lower panels) for detecting the relationships between sequence pairs related through alternative model I for different values of word size k = 2, 3, 4, 5 and sequence length n. The parameters were set at *p_A* = *p_T* = 1/6, *p_C* = *p_G* = 1/3, λ = 0.99, and type I error α = 0.05.

formula image — **FIG. 1.**
The values of C(λ) and C^*(λ) (upper panels) and the power of D2 and (lower panels) for detecting the relationships between sequence pairs related through alternative model I for different values of word size k = 2, 3, 4, 5 and sequence length n. The parameters were set at *p_A* = *p_T* = 1/6, *p_C* = *p_G* = 1/3, λ = 0.99, and type I error α = 0.05.

**FIG. 2.**
The values of B(λ) and B^*(λ) for λ = 0.93, 0.99 and k = 2, 3, 4, 5. Dashed lines refer to B and solid lines to B^*; triangles refer to λ = 0.93 and circles to λ = 0.99. B(0.99), dash line with circle points; B(0.99), dash line with triangle points; B^*(0.99), solid line with circle points; B^*(0.99), solid line with triangle points.

**FIG. 3.**
The sequence LOGO of motif “MA0003”.

**FIG. 4.**
The values of as a function of motif density λ and word length k, λ = 0.9 to 1.0 by step 0.01, and

See this image and copyright information in PMC

References

1. Burden C.J. Kantorovitz M.R. Wilson S.R. Approximate word matches between two random sequences. Ann. Appl. Probab. 2006;18:1–21.
1. Forêt S. Kantorovitz M.R. Burden C.J. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinform. 2006;7:S21. - PMC - PubMed
1. Forêt S. Wilson S.R. Burden C.J. Empirical distribution of k-word matches in biological sequences. Pattern Recogn. 2009a;42:539–548.
1. Forêt S. Wilson S.R. Burden C.J. Characterizing the D2 statistic: word matches in biological sequences. Stat. Appl. Genet. Mol. Biol. 2009b;8:43. - PubMed
1. Ivan A. Halfon M.S. Sinha S. Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs. Genome Biol. 2008;9:R22. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Alignment-free sequence comparison (II): theoretical power of comparison statistics

Affiliation

Alignment-free sequence comparison (II): theoretical power of comparison statistics

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources