Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 1;35(21):4493-4495.
doi: 10.1093/bioinformatics/btz284.

clustermq enables efficient parallelization of genomic analyses

Affiliations

clustermq enables efficient parallelization of genomic analyses

Michael Schubert. Bioinformatics. .

Abstract

Motivation: High performance computing (HPC) clusters play a pivotal role in large-scale bioinformatics analysis and modeling. For the statistical computing language R, packages exist to enable a user to submit their analyses as jobs on HPC schedulers. However, these packages do not scale well to high numbers of tasks, and their processing overhead quickly becomes a prohibitive bottleneck.

Results: Here we present clustermq, an R package that can process analyses up to three orders of magnitude faster than previously published alternatives. We show this for investigating genomic associations of drug sensitivity in cancer cell lines, but it can be applied to any kind of parallelizable workflow.

Availability and implementation: The package is available on CRAN and https://github.com/mschubert/clustermq. Code for performance testing is available at https://github.com/mschubert/clustermq-performance.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Performance evaluation of HPC packages for (a) processing overhead and (b) application to GDSC data. Along the range of tested number of function calls, clustermq requires substantially less time for processing in both scenarios. Indicated measurements are averages of two runs with range shown as vertical bars. (b) The dashed grey line indicates the actual number of calls required for all GDSC associations

References

    1. Bischl B. et al. (2015) BatchJobs and BatchExperiments: abstraction mechanisms for using R in batch environments. J. Stat. Softw., 64, 1–25.
    1. Gentleman R.C. et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol., 5, R80. - PMC - PubMed
    1. Hintjens P. (2013) ZeroMQ: Messaging for Many Applications. O’Reilly Media, Inc, Sebastopol, California.
    1. Ihaka R., Gentleman R. (1996) R: a language for data analysis and graphics. J. Comput. Graph. Stat., 5, 299–314.
    1. Iorio F. et al. (2016) A landscape of pharmacogenomic interactions in cancer. Cell, 166, 740–754. - PMC - PubMed