A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets

Barış Ekim^{1

2}, Bonnie Berger^{1

2}, Yaron Orenstein³

Affiliations

¹ Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
² Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
³ School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, 8410501 Beer-Sheva, Israel.

PMID: 38835399
PMCID: PMC11148856
DOI: 10.1007/978-3-030-45257-5_3

A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets

Barış Ekim et al. Res Comput Mol Biol. 2020 May.

. 2020 May:12074:37-53.

doi: 10.1007/978-3-030-45257-5_3. Epub 2020 Apr 21.

Authors

Barış Ekim^{1

2}, Bonnie Berger^{1

2}, Yaron Orenstein³

Affiliations

¹ Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
² Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
³ School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, 8410501 Beer-Sheva, Israel.

PMID: 38835399
PMCID: PMC11148856
DOI: 10.1007/978-3-030-45257-5_3

Abstract

As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of $k$ -mers that hit every sequence of length $L$ , and can thus serve as indices to $L$ -long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of $k$ (e.g. $k > 13$ ). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating $k$ -mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal $k$ -mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating nearoptimal UHSs, which newly handles $k > 13$ . We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small constant factor of the optimal size. PASHA's runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.

Keywords: Parallelization; Randomization; Universal hitting sets.

PubMed Disclaimer

Figures

**Fig. 1.**
Runtimes (left) and UHS sizes (divided by 10⁴, right) for values of $k = 10$ (A, B), 11 (C, D), 12 (E, F), and 13 (G, H) and $20 \leq L \leq 200$ for the different methods. Note that the y-axes for runtimes are in logarithmic scale.

**Fig. 2.**
Runtimes (A) and UHS sizes (divided by 10⁶) (B) for $14 \leq k \leq 16$ and $L = 100$ for PASHA. Note that the y-axis for runtime is in logarithmic scale.

**Fig. 3.**
Mean approximate expected density (A), and density on the human reference genome (B) for different methods, for $5 \leq k \leq 16$ and $L = 100$ . Error bars represent one standard deviation from the mean across 10 random sequences of length 10⁶. Density is the fraction of selected $k$ -mer positions over the number of $k$ -mers in the sequence.

See this image and copyright information in PMC

References

1. Berger B, Peng J, Singh M: Computational solutions for omics data. Nat. Rev. Genet 14(5), 333 (2013) - PMC - PubMed
1. Berger B, Rompel J, Shor PW: Efficient NC algorithms for set cover with applications to learning and geometry. J. Comput. Syst. Sci 49(3), 454–477 (1994)
1. DeBlasio D, Gbosibo F, Kingsford C, Marçais G.: Practical universal k-mer sets for minimizer schemes. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 167–176. ACM (2019)
1. Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015) - PubMed
1. Johnson DS: Approximation algorithms for combinatorial problems. J. Comput. Syst. Sci 9(3), 256–278 (1974)

Grants and funding

R01 GM081871/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets

Affiliations

A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous