Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb;29(2):155-168.
doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Affiliations

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Antonio Blanca et al. J Comput Biol. 2022 Feb.

Abstract

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Keywords: Jaccard similarity; MinHash; confidence intervals; k-mers; mutation process; sketching.

PubMed Disclaimer

Figures

FIG. 1.
FIG. 1.
An example of the simple mutation process, with L=36 and k=5. There are five nucleotides that are mutated (marked with an x). For example, the mutation at position 10 mutates the k-spans K6,,K10 (marked in red). Note that an isolated nucleotide mutation (e.g., at position 10) can affect up to k k-spans (e.g., K6,,K10), but nearby nucleotide mutations can affect the same k-span (e.g., mutation of nucleotides at positions 23 and 27 both affect K23.) There are two islands (marked in blue) and three oceans, and Nmut=21. For example, K19,,K34 is an island, and K35 is an ocean.
FIG. 2.
FIG. 2.
Estimates of sequence divergence as done by mimimap2 (δ^) and by us (r^1). Reads are simulated from a random 10 kbp sequence introducing mutations at the given r1 rate. For each r1 value, 100 reads are used. As in Li (2018), we use k=15 and, using a random hash function, identify as seeds the k-mer minimizers, one for every window of 25 k-mers. In the case when δ^ is undefined, we set δ^=1.
FIG. 3.
FIG. 3.
Box and whisker plot of Cber scores for 5000 replicates of random strings of length 10,000nt, with mutations introduced at a rate of r1=0.1. The solid black line corresponds to the empirical median of Cber, while the dashed top line corresponds to ECber+z0.05VarCber and the bottom dot-dashed line corresponds to ECberz0.05VarCber, both computed from Theorem 11.

References

    1. Bankevich, A., Nurk, S., Antipov, D., et al. . 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477. - PMC - PubMed
    1. Broder, A.Z. 1997. On the resemblance and containment of documents, pp. 21–29. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, Salerno, Italy.
    1. Brown, C.T., Olm, M.R., Thomas, B.C., and Banfield, J.F.. 2016. Measurement of bacterial replication rates in microbial communities. Nat. Biotechnol. 34, 1256. - PMC - PubMed
    1. Brown, L.D., Cai, T.T., and DasGupta, A.. 2001. Interval estimation for a binomial proportion. Stat. Sci. 16, 101–117.
    1. Burden, C.J., Leopardi, P., and Forêt, S.. 2014. The distribution of word matches between markovian sequences with periodic boundary conditions. J. Comput. Biol. 21, 41–63. - PMC - PubMed

Publication types

LinkOut - more resources