Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Nov;17(11):1467-90.
doi: 10.1089/cmb.2010.0056. Epub 2010 Oct 25.

Alignment-free sequence comparison (II): theoretical power of comparison statistics

Affiliations

Alignment-free sequence comparison (II): theoretical power of comparison statistics

Lin Wan et al. J Comput Biol. 2010 Nov.

Abstract

Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2*approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2* for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.

PubMed Disclaimer

Figures

FIG. 1.
FIG. 1.
The values of C(λ) and C*(λ) (upper panels) and the power of D2 and formula image (lower panels) for detecting the relationships between sequence pairs related through alternative model I for different values of word size k = 2, 3, 4, 5 and sequence length n. The parameters were set at pA = pT = 1/6, pC = pG = 1/3, λ = 0.99, and type I error α = 0.05.
FIG. 2.
FIG. 2.
The values of B(λ) and B*(λ) for λ = 0.93, 0.99 and k = 2, 3, 4, 5. Dashed lines refer to B and solid lines to B*; triangles refer to λ = 0.93 and circles to λ = 0.99. B(0.99), dash line with circle points; B(0.99), dash line with triangle points; B*(0.99), solid line with circle points; B*(0.99), solid line with triangle points.
FIG. 3.
FIG. 3.
The sequence LOGO of motif “MA0003”.
FIG. 4.
FIG. 4.
The values of formula image as a function of motif density λ and word length k, λ = 0.9 to 1.0 by step 0.01, and formula image

References

    1. Burden C.J. Kantorovitz M.R. Wilson S.R. Approximate word matches between two random sequences. Ann. Appl. Probab. 2006;18:1–21.
    1. Forêt S. Kantorovitz M.R. Burden C.J. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinform. 2006;7:S21. - PMC - PubMed
    1. Forêt S. Wilson S.R. Burden C.J. Empirical distribution of k-word matches in biological sequences. Pattern Recogn. 2009a;42:539–548.
    1. Forêt S. Wilson S.R. Burden C.J. Characterizing the D2 statistic: word matches in biological sequences. Stat. Appl. Genet. Mol. Biol. 2009b;8:43. - PubMed
    1. Ivan A. Halfon M.S. Sinha S. Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs. Genome Biol. 2008;9:R22. - PMC - PubMed

Publication types

LinkOut - more resources