Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Mar 15;27(6):764-70.
doi: 10.1093/bioinformatics/btr011. Epub 2011 Jan 7.

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

Affiliations

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

Guillaume Marçais et al. Bioinformatics. .

Abstract

Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm.

Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.

Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Memory usage for various levels of sequencing coverage on reads generated during the Turkey genome project when counting 22-mers. Except for Tallymer (which is inherently single threaded), all programs were run using 32 threads. The memory usage for the serial and 32-thread versions of Jellyfish is almost identical (results are shown using 32 threads).
Fig. 2.
Fig. 2.
Computation time versus sequencing coverage on reads generated during the Turkey genome project. Except as noted all programs were run using 32 threads.
Fig. 3.
Fig. 3.
A trace of Jellyfish's CPU usage and IO throughput on counting 22-mers on coverage 5× of the Turkey reads with 32 threads. CPU usage is split into ‘system’ (corresponding to all system calls for memory allocation, read/write from disk, etc.) and ‘user’ (the program). The ‘percent activity’ is a global activity measure over all 32 cores. The IO throughput is split into ‘in’ for input and ‘out’ for output.
Fig. 4.
Fig. 4.
Speedup versus number of threads on coverage 5× of the Turkey reads. On this log–log scale plot, a perfectly linear speedup would correspond to a diagonal line. The ‘no IO’ curves includes only the initialization and counting phase times. The curve marked ‘with IO’ counts the total runtime.

References

    1. Campagna D, et al. RAP: a new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics. 2005;21:582–588. - PubMed
    1. Cormen T, et al. Introduction to Algorithms. Cambridge, MA: MIT Press; 1990. Chapter 12.
    1. Dalloul RA, et al. Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis. PLoS Biol. 2010;8:e1000475. - PMC - PubMed
    1. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun. ACM. 2008;51:107–113.
    1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. - PMC - PubMed

Publication types