Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 12;37(Suppl_1):i187-i195.
doi: 10.1093/bioinformatics/btab313.

Sequence-specific minimizers via polar sets

Affiliations

Sequence-specific minimizers via polar sets

Hongyu Zheng et al. Bioinformatics. .

Abstract

Motivation: Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences.

Results: We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers.

Availability and implementation: A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Comparing universal hitting sets, perfect seeds (compatible minimizers become perfect minimizers) and polar sets. Each block indicates a k-mer, and each segment indicates a window of length 5 (w =5). To provide a better contrast with universal hitting sets, we show polar sets with slackness s =0 (see Definition 2)
Fig. 2.
Fig. 2.
Examples for our argument for the polar set density bound with w =5. Left top: Legend for a context, and when it is charged. Right top: Case for a singleton polar k-mer without links. In this case, L(S,A)=0. Bottom: Case for three linked polar k-mers. Whatever the ordering between the three polar k-mers, two of the four contexts marked in blue will be charged. The link energy L(S,A)=1/3: A and B are l =3 bases away with energy 2l/(w+1)1=0, B and C are l =4 bases away with energy 2l/(w+1)1=1/3
Fig. 3.
Fig. 3.
Examples of layered polar sets, with three layers. Without layered polar sets, the k-mers from layer 2 and 3 could not be selected as in the polar set because of self-collision. The whole sequence is covered in this case (every window contains a polar k-mer from one layer). Layer 1 is the one with highest priority and our layered heuristics construct it first
Fig. 4.
Fig. 4.
(A) Energy surplus and deficit for short (w =10) and for long (w =100) windows, computed on the human reference sequence hg38. The difference between the two lines is the difference between the upper and lower bound of Theorem 1. It is very small and the bounds are very good estimates in practice. (B) Density factor for the proposed methods, for short and long windows, computed on hg38. The bottom orange dashed line is the theoretical minimum density (perfect minimizers)

References

    1. Almutairy M., Torng E. (2018) Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches. PLoS One, 13, e0189960. - PMC - PubMed
    1. Blackburn S.R. (2015) Non-overlapping codes. IEEE Trans. Inf. Theory, 61, 4890–4894.
    1. Chikhi R. et al. (2016) Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32, i201–i208. - PMC - PubMed
    1. DeBlasio D. et al. (2019) Practical universal k-mer sets for minimizer schemes. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB ’19, ACM, New York, NY, USA. pp. 167–176.
    1. Deorowicz S. et al. (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31, 1569–1576. - PubMed

Publication types