Weighted minimizer sampling improves long read mapping

Chirag Jain¹, Arang Rhie¹, Haowen Zhang², Claudia Chu², Brian P Walenz¹, Sergey Koren¹, Adam M Phillippy¹

Affiliations

¹ National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
² College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA.

PMID: 32657365
PMCID: PMC7355284
DOI: 10.1093/bioinformatics/btaa435

Weighted minimizer sampling improves long read mapping

Chirag Jain et al. Bioinformatics. 2020.

. 2020 Jul 1;36(Suppl_1):i111-i118.

doi: 10.1093/bioinformatics/btaa435.

Authors

Chirag Jain¹, Arang Rhie¹, Haowen Zhang², Claudia Chu², Brian P Walenz¹, Sergey Koren¹, Adam M Phillippy¹

Affiliations

¹ National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
² College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA.

PMID: 32657365
PMCID: PMC7355284
DOI: 10.1093/bioinformatics/btaa435

Abstract

Motivation: In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions.

Results: We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.

Availability and implementation: Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap.

Published by Oxford University Press 2020.

PubMed Disclaimer

Figures

**Fig. 1.**
Visualization of the minimizer sampling technique (Roberts *et al.*, 2004). The sequence being sampled is shown as a black line. Assuming a pre-defined order of k-mers (e.g. lexicographical), the sampling selects each of the smallest k-mer(s) (shown as red arrows) in consecutive sliding windows (shown as blue intervals). The length of each window, i.e. w is three in the above example

**Fig. 2.**
Visualizing a single iteration of Algorithm 1. As the window slides to the right, the new k-mer is inserted at the back of the double-ended queue Q, and a minimizer is selected from front end of the queue. The above figure also uses two equal k-mers (order=5) in different colours to highlight how ties are handled in Algorithm 1

**Fig. 3.**
Visualization of tie breaking in the standard minimizer and robust winnowing algorithms. We use a low-complexity sequence ‘CACACA…’ to illustrate the difference between the two approaches. The sequence comprised two 3-mers, ‘CAC’ and ‘ACA’, which are ordered lexicographically in the above example. In the left plot, we break ties by sampling all equally minimum k-mers per window, where as we follow the robust winnowing tie breaking in the right plot. Note that robust winnowing samples half the minimizers as compared to the standard approach in this repetitive sequence example

**Fig. 4.**
Step histogram plot showing frequency of minimizers sampled using a complete human chromosome X as the reference. Frequency of minimizers was computed in each consecutive ‘super-window’ of length 1 Mbp across the reference sequence. We compare three sampling algorithms using window length parameter w = 10. ‘Standard’ method refers to the classic minimizer sampling algorithm from Roberts *et al.* (2004), without any masking or modification. Minimap2 uses the standard algorithm, but masks the most frequently occurring minimizers (top 0.02%) in the reference (count $\geq 160$ for this reference). Winnowmap uses weighted minimizer sampling. Both ‘Standard’ and Winnowmap methods maintain at least one minimizer per window in their index and achieve near-uniform density. The masking heuristic in Minimap2 reduces minimizer density throughout the chromosome X, and a significant drop is observed in its long repetitive centromere (58–61 Mbp). Interestingly, we also observe a slight increase in Winnowmap’s minimizer density in this region

**Fig. 5.**
Benefit of the proposed minimizer sampling optimizations to long read mapping performance. We compare three methods: ‘Standard’ (default minimizer sampling, no masking), ‘Optimization 1’ (robust winnowing) and ‘Optimization 1 + 2’ (robust winnowing and weighted minimizer sampling). All three methods select at least one minimizer in each window. The left plot indicates total mapping time whereas the right plot indicates total memory usage, using datasets D1st. The x-axis is log-scaled in both plots. The combination of both optimizations is crucial in Winnowmap to achieve good efficiency

See this image and copyright information in PMC

References

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
1. Baharav,T.Z. et al. (2020) Spectral Jaccard Similarity: A new approach to estimating pairwise sequence alignments. bioRxiv. doi: 10.1101/800581. - PMC - PubMed
1. Berlin K. et al. (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623–630. - PubMed
1. Broder A.Z. (1997) On the resemblance and containment of documents. In Proceedings Compression and Complexity of Sequences 1997 (Cat. No. 97TB100171), pp. 21–29. IEEE.
1. Chikhi R. et al. (2015) On the representation of de Bruijn graphs. J. Comput. Biol., 22, 336–352. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Weighted minimizer sampling improves long read mapping

Affiliations

Weighted minimizer sampling improves long read mapping

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous