. 2013 May 16:14:160.

doi: 10.1186/1471-2105-14-160.

Disk-based k-mer counting on a PC

Sebastian Deorowicz¹, Agnieszka Debudaj-Grabysz, Szymon Grabowski

Affiliations

PMID: 23679007
PMCID: PMC3680041
DOI: 10.1186/1471-2105-14-160

Disk-based k-mer counting on a PC

Sebastian Deorowicz et al. BMC Bioinformatics. 2013.

. 2013 May 16:14:160.

doi: 10.1186/1471-2105-14-160.

Authors

Sebastian Deorowicz¹, Agnieszka Debudaj-Grabysz, Szymon Grabowski

Affiliation

¹ Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland. sebastian.deorowicz@polsl.pl

PMID: 23679007
PMCID: PMC3680041
DOI: 10.1186/1471-2105-14-160

Abstract

Background: The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.

Results: We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.

Conclusions: By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.

PubMed Disclaimer

Figures

**Figure 1**
A scheme of the parallel KMC algorithm.

**Figure 2**
**An example of 2-stage** k-mer prefix removal, for one k-mer starting with symbol A and one starting with non-A. In total, 9 starting characters are effectively removed before storing the k-mer on a temporary disk. The part of length p₁ stands for the ID of the bin the given k-mer is inserted into, and the part of length p₂ is discarded to reduce the temporary storage (a way to recover later the removed part of length p₂ is not shown here, for clarity; see text).

**Figure 3**
**Algorithm of splitting** k-mers in reads to bins according to their prefix. A full bin is compacted by sorting on p₂-symbol long prefix and removing this prefix.

**Figure 4**
Algorithm of collecting compacted bin chunks from all splitting threads.

**Figure 5**
**Algorithm of sorting** k-mers within bins. Fast radix sort procedure is used here.

**Figure 6**
**The KMC processing time (red lines) and used disk space (blue lines) as a function of k for (a) C. elegans dataset, (b) HG02057 dataset.** The solid lines are for classic counters, the dashed lines for Quake-compatible counters.

**Figure 7**
**HG02057 (gzipped) dataset,** ***k=40*** **, classic counters.** The KMC processing time (red line) and used disk space (blue line) as a function of allowed RAM memory.

**Figure 8**
**HG02057 (gzipped) dataset, classic counters.** Influence of the no. of HDDs on the processing time: 3 HDDs (green line), 2 HDDs (blue line), and 1 HDD (red line).

See this image and copyright information in PMC

References

1. Compeau PE, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nature Biotechnol. 2011;29(11):987–991. doi: 10.1038/nbt.2023. - DOI - PMC - PubMed
1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. doi: 10.1093/nar/gkh340. - DOI - PMC - PubMed
1. Kurtz S, Narechania A, Stein J, Ware D. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics. 2008;9(1):517. doi: 10.1186/1471-2164-9-517. - DOI - PMC - PubMed
1. Kelley DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. - DOI - PMC - PubMed
1. Miller JR, Delcher AL, Koren S, Venter E, Walenz B, Brownley A, Johnson J, Li K, Mobarry CM, Sutton GG. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics. 2008;24(24):2818–2824. doi: 10.1093/bioinformatics/btn548. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Disk-based k-mer counting on a PC

Affiliation

Disk-based k-mer counting on a PC

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous