KMC 2: fast and resource-frugal k-mer counting

Sebastian Deorowicz¹, Marek Kokot¹, Szymon Grabowski¹, Agnieszka Debudaj-Grabysz¹

Affiliations

Affiliation

¹ Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice and Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland.

PMID: 25609798
DOI: 10.1093/bioinformatics/btv022

KMC 2: fast and resource-frugal k-mer counting

Sebastian Deorowicz et al. Bioinformatics. 2015.

. 2015 May 15;31(10):1569-76.

doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.

Authors

Sebastian Deorowicz¹, Marek Kokot¹, Szymon Grabowski¹, Agnieszka Debudaj-Grabysz¹

Affiliation

¹ Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice and Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland.

PMID: 25609798
DOI: 10.1093/bioinformatics/btv022

Abstract

Motivation: Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory.

Results: We present a novel method for k-mer counting, on large datasets about twice faster than the strongest competitors (Jellyfish 2, KMC 1), using about 12 GB (or less) of RAM. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k, x)-mers allows to significantly reduce the I/O and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC 2 counts the 28-mers of a human reads collection with 44-fold coverage (106 GB of compressed size) in about 20 min, on a 6-core Intel i7 PC with an solid-state disk.

PubMed Disclaimer

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Silverchair Information Systems
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

KMC 2: fast and resource-frugal k-mer counting

Affiliation

KMC 2: fast and resource-frugal k-mer counting

Authors

Affiliation

Abstract

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous