. 2021 Jul 12;37(Suppl_1):i187-i195.

doi: 10.1093/bioinformatics/btab313.

Sequence-specific minimizers via polar sets

Hongyu Zheng¹, Carl Kingsford¹, Guillaume Marçais¹

Affiliations

PMID: 34252928
PMCID: PMC8686682
DOI: 10.1093/bioinformatics/btab313

Sequence-specific minimizers via polar sets

Hongyu Zheng et al. Bioinformatics. 2021.

. 2021 Jul 12;37(Suppl_1):i187-i195.

doi: 10.1093/bioinformatics/btab313.

Authors

Hongyu Zheng¹, Carl Kingsford¹, Guillaume Marçais¹

Affiliation

¹ Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

PMID: 34252928
PMCID: PMC8686682
DOI: 10.1093/bioinformatics/btab313

Abstract

Motivation: Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences.

Results: We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers.

Availability and implementation: A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Comparing universal hitting sets, perfect seeds (compatible minimizers become perfect minimizers) and polar sets. Each block indicates a k-mer, and each segment indicates a window of length 5 (w = 5). To provide a better contrast with universal hitting sets, we show polar sets with slackness s = 0 (see Definition 2)

**Fig. 2.**
Examples for our argument for the polar set density bound with w = 5. Left top: Legend for a context, and when it is charged. Right top: Case for a singleton polar k-mer without links. In this case, $L (S, A) = 0$ . Bottom: Case for three linked polar k-mers. Whatever the ordering between the three polar k-mers, two of the four contexts marked in blue will be charged. The link energy $L (S, A) = 1 / 3$ : A and B are l = 3 bases away with energy $2 l / (w + 1) - 1 = 0$ , B and C are l = 4 bases away with energy $2 l / (w + 1) - 1 = 1 / 3$

**Fig. 3.**
Examples of layered polar sets, with three layers. Without layered polar sets, the k-mers from layer 2 and 3 could not be selected as in the polar set because of self-collision. The whole sequence is covered in this case (every window contains a polar k-mer from one layer). Layer 1 is the one with highest priority and our layered heuristics construct it first

**Fig. 4.**
(A) Energy surplus and deficit for short (w = 10) and for long (w = 100) windows, computed on the human reference sequence hg38. The difference between the two lines is the difference between the upper and lower bound of Theorem 1. It is very small and the bounds are very good estimates in practice. (B) Density factor for the proposed methods, for short and long windows, computed on hg38. The bottom orange dashed line is the theoretical minimum density (perfect minimizers)

See this image and copyright information in PMC

References

1. Almutairy M., Torng E. (2018) Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches. PLoS One, 13, e0189960. - PMC - PubMed
1. Blackburn S.R. (2015) Non-overlapping codes. IEEE Trans. Inf. Theory, 61, 4890–4894.
1. Chikhi R. et al. (2016) Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32, i201–i208. - PMC - PubMed
1. DeBlasio D. et al. (2019) Practical universal k-mer sets for minimizer schemes. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB ’19, ACM, New York, NY, USA. pp. 167–176.
1. Deorowicz S. et al. (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31, 1569–1576. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01GM122935/NH/NIH HHS/United States

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Sequence-specific minimizers via polar sets

Affiliation

Sequence-specific minimizers via polar sets

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous