Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug 1;31(15):2482-8.
doi: 10.1093/bioinformatics/btv179. Epub 2015 Mar 26.

Halvade: scalable sequence analysis with MapReduce

Affiliations

Halvade: scalable sequence analysis with MapReduce

Dries Decap et al. Bioinformatics. .

Abstract

Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.

Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of the Halvade framework. The entries of pairs of input FASTQ files (containing paired-end reads) are interleaved and stored as smaller chunks. Map tasks are executed in parallel, each task taking a single chunk as input and aligning the reads to a reference genome using an existing tool. The map tasks emit pairs where the key contains positional information of an aligned read and the value corresponds to a SAM record. The aligned reads are grouped and sorted per chromosomal region. Chromosomal regions are processed in parallel in the reduce phase, this includes data preparation and variant detection again using tools of choice. Each reduce task outputs the variants of the region it processed. These variants can optionally be merged into a single VCF file. Note that the names of the tools shown correspond to those of the GATK Best Practices DNA-seq implementation in Halvade
Fig. 2.
Fig. 2.
The parallel speedup (multithreading) of five GATK modules used in the Best Practices pipeline on a 16-core node with 94 GB of RAM. The limited speedup prevents the efficient use of this node with more than a handful of CPU cores. Option -nt denotes data threads while option -nct denotes CPU threads (cfr. GATK manual)
Fig. 3.
Fig. 3.
The speedup (primary y-axis) and parallel efficiency (secondary y-axis) of Halvade as a function of number of parallel tasks (cluster size) on both an Intel Big Data cluster and Amazon EMR

References

    1. Dean J., Ghemawat S. (2008) MapReduce: simplified data processing on large clusters. Commun. ACM, 51, 107–113.
    1. Depristo M.A., et al. (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet., 43, 491––498.. - PMC - PubMed
    1. Deyhim P. (2013). Best Practices for Amazon EMR. Technical report, Amazon Web Services Inc.
    1. Fonseca N.A., et al. (2012) Tools for mapping high-throughput sequencing data. Bioinformatics, 28, 3169–3177. - PubMed
    1. Langmead B., et al. (2009a) Searching for SNPs with cloud computing. Genome Biol., 10, R134. - PMC - PubMed

Publication types