Data structures based on k-mers for querying large collections of sequencing data sets

Camille Marchet¹, Christina Boucher², Simon J Puglisi³, Paul Medvedev^{4

5

6}, Mikaël Salson¹, Rayan Chikhi⁷

Affiliations

¹ Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France.
² Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA.
³ Department of Computer Science, University of Helsinki, FI-00014, Helsinki, Finland.
⁴ Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
⁵ Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
⁶ Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
⁷ Institut Pasteur & CNRS, C3BI USR 3756, F-75015 Paris, France.

PMID: 33328168
PMCID: PMC7849385
DOI: 10.1101/gr.260604.119

Review

Data structures based on k-mers for querying large collections of sequencing data sets

Camille Marchet et al. Genome Res. 2021 Jan.

. 2021 Jan;31(1):1-12.

doi: 10.1101/gr.260604.119. Epub 2020 Dec 16.

Authors

Camille Marchet¹, Christina Boucher², Simon J Puglisi³, Paul Medvedev^{4

5

6}, Mikaël Salson¹, Rayan Chikhi⁷

Affiliations

¹ Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France.
² Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA.
³ Department of Computer Science, University of Helsinki, FI-00014, Helsinki, Finland.
⁴ Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
⁵ Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
⁶ Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
⁷ Institut Pasteur & CNRS, C3BI USR 3756, F-75015 Paris, France.

PMID: 33328168
PMCID: PMC7849385
DOI: 10.1101/gr.260604.119

Abstract

High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.

PubMed Disclaimer

Figures

**Figure 1.**
Timeline, main relationships, and application highlights for the methods covered in this survey.

**Figure 2.**
Overview of set of k-mer sets building blocks. We classified strategies in color-aggregative approaches and k-mer aggregative approaches (second column). The *top* row of the figure indicates the general categories of components of each method: the type of k-mer set; the way multiple sets are combined together; and an optional compression scheme. Each next row describes one of the surveyed methods. The cells in this figure are methodological choice, potentially common across methods; hence many cells are joined.

See this image and copyright information in PMC

References

1. Almeida A, Nayfach S, Boland M, Strozzi F, Beracochea M, Shi ZJ, Pollard KS, Sakharova E, Parks DH, Hugenholtz P, et al. 2020. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol 10.1038/s41587-020-0603-3 - DOI - PMC - PubMed
1. Almodaresi F, Pandey P, Patro R. 2017. Rainbowfish: a succinct colored de Bruijn graph representation. In Proceedings of the Seventeenth International Workshop on Algorithms in Bioinformatics, Boston Dagstuhl Publishing, Saarbrücken/Wadern, Germany.
1. Almodaresi F, Sarkar H, Srivastava A, Patro R. 2018. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 34: i169–i177. 10.1093/bioinformatics/bty292 - DOI - PMC - PubMed
1. Almodaresi F, Pandey P, Ferdman M, Johnson R, Patro R. 2019. An efficient, scalable and exact representation of high-dimensional color information enabled via de Bruijn graph search. In Proceedings of the International Conference on Research in Computational Molecular Biology, Washington, pp. 1–18. Springer, New York. - PMC - PubMed
1. Bender MA, Farach-Colton M, Johnson R, Kraner R, Kuszmaul BC, Medjedovic D, Montes P, Shetty P, Spillane RP, Zadok E. 2012. Don't thrash: how to cache your hash on flash. PVLDB 5: 1627–1637.

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

R01 AI141810/AI/NIAID NIH HHS/United States

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Data structures based on k-mers for querying large collections of sequencing data sets

Affiliations

Data structures based on k-mers for querying large collections of sequencing data sets

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials