The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Antonio Blanca¹, Robert S Harris², David Koslicki^{1

2

3}, Paul Medvedev^{1

3

4}

Affiliations

¹ Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
² Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.
³ Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA.
⁴ Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.

PMID: 35108101
PMCID: PMC11978275
DOI: 10.1089/cmb.2021.0431

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Antonio Blanca et al. J Comput Biol. 2022 Feb.

. 2022 Feb;29(2):155-168.

doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.

Authors

Antonio Blanca¹, Robert S Harris², David Koslicki^{1

2

3}, Paul Medvedev^{1

3

4}

Affiliations

¹ Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
² Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.
³ Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA.
⁴ Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.

PMID: 35108101
PMCID: PMC11978275
DOI: 10.1089/cmb.2021.0431

Abstract

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Keywords: Jaccard similarity; MinHash; confidence intervals; k-mers; mutation process; sketching.

PubMed Disclaimer

Figures

**FIG. 1.**
An example of the simple mutation process, with $L = 36$ and $k = 5$ . There are five nucleotides that are mutated (marked with an x). For example, the mutation at position 10 mutates the k-spans $K_{6}, \dots, K_{10}$ (*marked in red*). Note that an isolated nucleotide mutation (e.g., at position 10) can affect up to *k k*-spans (e.g., $K_{6}, \dots, K_{10}$ ), but nearby nucleotide mutations can affect the same k-span (e.g., mutation of nucleotides at positions 23 and 27 both affect $K_{23}$ .) There are two islands (*marked in blue*) and three oceans, and $N_{m u t} = 21$ . For example, $K_{19}, \dots, K_{34}$ is an island, and $K_{35}$ is an ocean.

**FIG. 2.**
Estimates of sequence divergence as done by mimimap2 ( $\hat{δ}$ ) and by us ( ${\hat{r}}_{1}$ ). Reads are simulated from a random 10 kbp sequence introducing mutations at the given r₁ rate. For each r₁ value, 100 reads are used. As in Li (2018), we use $k = 15$ and, using a random hash function, identify as seeds the k-mer minimizers, one for every window of 25 k-mers. In the case when $\hat{δ}$ is undefined, we set $\hat{δ} = 1$ .

**FIG. 3.**
Box and whisker plot of $C_{b e r}$ scores for 5000 replicates of random strings of length 10,000nt, with mutations introduced at a rate of $r_{1} = 0.1$ . The *solid black line* corresponds to the empirical median of $C_{b e r}$ , while the *dashed top line* corresponds to $E [C_{b e r}] + z_{0.05} \sqrt{V a r (C_{b e r})}$ and the *bottom dot-dashed line* corresponds to $E [C_{b e r}] - z_{0.05} \sqrt{V a r (C_{b e r})}$ , both computed from Theorem 11.

See this image and copyright information in PMC

References

1. Bankevich, A., Nurk, S., Antipov, D., et al. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477. - PMC - PubMed
1. Broder, A.Z. 1997. On the resemblance and containment of documents, pp. 21–29. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, Salerno, Italy.
1. Brown, C.T., Olm, M.R., Thomas, B.C., and Banfield, J.F.. 2016. Measurement of bacterial replication rates in microbial communities. Nat. Biotechnol. 34, 1256. - PMC - PubMed
1. Brown, L.D., Cai, T.T., and DasGupta, A.. 2001. Interval estimation for a binomial proportion. Stat. Sci. 16, 101–117.
1. Burden, C.J., Leopardi, P., and Forêt, S.. 2014. The distribution of word matches between markovian sequences with periodic boundary conditions. J. Comput. Biol. 21, 41–63. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Atypon
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Affiliations

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous