Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 May 16:14:160.
doi: 10.1186/1471-2105-14-160.

Disk-based k-mer counting on a PC

Affiliations

Disk-based k-mer counting on a PC

Sebastian Deorowicz et al. BMC Bioinformatics. .

Abstract

Background: The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.

Results: We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.

Conclusions: By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A scheme of the parallel KMC algorithm.
Figure 2
Figure 2
An example of 2-stage k-mer prefix removal, for one k-mer starting with symbol A and one starting with non-A. In total, 9 starting characters are effectively removed before storing the k-mer on a temporary disk. The part of length p1 stands for the ID of the bin the given k-mer is inserted into, and the part of length p2 is discarded to reduce the temporary storage (a way to recover later the removed part of length p2 is not shown here, for clarity; see text).
Figure 3
Figure 3
Algorithm of splitting k-mers in reads to bins according to their prefix. A full bin is compacted by sorting on p2-symbol long prefix and removing this prefix.
Figure 4
Figure 4
Algorithm of collecting compacted bin chunks from all splitting threads.
Figure 5
Figure 5
Algorithm of sorting k-mers within bins. Fast radix sort procedure is used here.
Figure 6
Figure 6
The KMC processing time (red lines) and used disk space (blue lines) as a function of k for (a) C. elegans dataset, (b) HG02057 dataset. The solid lines are for classic counters, the dashed lines for Quake-compatible counters.
Figure 7
Figure 7
HG02057 (gzipped) dataset, k=40 , classic counters. The KMC processing time (red line) and used disk space (blue line) as a function of allowed RAM memory.
Figure 8
Figure 8
HG02057 (gzipped) dataset, classic counters. Influence of the no. of HDDs on the processing time: 3 HDDs (green line), 2 HDDs (blue line), and 1 HDD (red line).

References

    1. Compeau PE, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nature Biotechnol. 2011;29(11):987–991. doi: 10.1038/nbt.2023. - DOI - PMC - PubMed
    1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797. doi: 10.1093/nar/gkh340. - DOI - PMC - PubMed
    1. Kurtz S, Narechania A, Stein J, Ware D. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics. 2008;9(1):517. doi: 10.1186/1471-2164-9-517. - DOI - PMC - PubMed
    1. Kelley DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. - DOI - PMC - PubMed
    1. Miller JR, Delcher AL, Koren S, Venter E, Walenz B, Brownley A, Johnson J, Li K, Mobarry CM, Sutton GG. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics. 2008;24(24):2818–2824. doi: 10.1093/bioinformatics/btn548. - DOI - PMC - PubMed

Publication types

LinkOut - more resources