Halvade: scalable sequence analysis with MapReduce
- PMID: 25819078
- PMCID: PMC4514927
- DOI: 10.1093/bioinformatics/btv179
Halvade: scalable sequence analysis with MapReduce
Abstract
Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.
Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.
© The Author 2015. Published by Oxford University Press.
Figures



References
-
- Dean J., Ghemawat S. (2008) MapReduce: simplified data processing on large clusters. Commun. ACM, 51, 107–113.
-
- Deyhim P. (2013). Best Practices for Amazon EMR. Technical report, Amazon Web Services Inc.
-
- Fonseca N.A., et al. (2012) Tools for mapping high-throughput sequencing data. Bioinformatics, 28, 3169–3177. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Research Materials