Container-based bioinformatics with Pachyderm
- PMID: 30101309
- PMCID: PMC6394392
- DOI: 10.1093/bioinformatics/bty699
Container-based bioinformatics with Pachyderm
Abstract
Motivation: Computational biologists face many challenges related to data size, and they need to manage complicated analyses often including multiple stages and multiple tools, all of which must be deployed to modern infrastructures. To address these challenges and maintain reproducibility of results, researchers need (i) a reliable way to run processing stages in any computational environment, (ii) a well-defined way to orchestrate those processing stages and (iii) a data management layer that tracks data as it moves through the processing pipeline.
Results: Pachyderm is an open-source workflow system and data management framework that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem, having Kubernetes as the backbone for container orchestration. We adapted Pachyderm and demonstrated its attractive properties in bioinformatics. A Helm Chart was created so that researchers can use Pachyderm in multiple scenarios. The Pachyderm File System was extended to support block storage. A wrapper for initiating Pachyderm on cloud-agnostic virtual infrastructures was created. The benefits of Pachyderm are illustrated via a large metabolomics workflow, demonstrating that Pachyderm enables efficient and sustainable data science workflows while maintaining reproducibility and scalability.
Availability and implementation: Pachyderm is available from https://github.com/pachyderm/pachyderm. The Pachyderm Helm Chart is available from https://github.com/kubernetes/charts/tree/master/stable/pachyderm. Pachyderm is available out-of-the-box from the PhenoMeNal VRE (https://github.com/phnmnl/KubeNow-plugin) and general Kubernetes environments instantiated via KubeNow. The code of the workflow used for the analysis is available on GitHub (https://github.com/pharmbio/LC-MS-Pachyderm).
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author(s) 2018. Published by Oxford University Press.
Figures




Similar articles
-
Interoperable and scalable data analysis with microservices: applications in metabolomics.Bioinformatics. 2019 Oct 1;35(19):3752-3760. doi: 10.1093/bioinformatics/btz160. Bioinformatics. 2019. PMID: 30851093 Free PMC article.
-
PhenoMeNal: processing and analysis of metabolomics data in the cloud.Gigascience. 2019 Feb 1;8(2):giy149. doi: 10.1093/gigascience/giy149. Gigascience. 2019. PMID: 30535405 Free PMC article.
-
Automated workflow composition in mass spectrometry-based proteomics.Bioinformatics. 2019 Feb 15;35(4):656-664. doi: 10.1093/bioinformatics/bty646. Bioinformatics. 2019. PMID: 30060113 Free PMC article.
-
Navigating freely-available software tools for metabolomics analysis.Metabolomics. 2017;13(9):106. doi: 10.1007/s11306-017-1242-7. Epub 2017 Aug 9. Metabolomics. 2017. PMID: 28890673 Free PMC article. Review.
-
Containers in Bioinformatics: Applications, Practical Considerations, and Best Practices in Molecular Pathology.J Mol Diagn. 2022 May;24(5):442-454. doi: 10.1016/j.jmoldx.2022.01.006. Epub 2022 Feb 18. J Mol Diagn. 2022. PMID: 35189355 Review.
Cited by
-
Evaluation of serverless computing for scalable execution of a joint variant calling workflow.PLoS One. 2021 Jul 9;16(7):e0254363. doi: 10.1371/journal.pone.0254363. eCollection 2021. PLoS One. 2021. PMID: 34242357 Free PMC article.
-
On-demand virtual research environments using microservices.PeerJ Comput Sci. 2019 Nov 11;5:e232. doi: 10.7717/peerj-cs.232. eCollection 2019. PeerJ Comput Sci. 2019. PMID: 33816885 Free PMC article.
-
Empowering bioinformatics communities with Nextflow and nf-core.Genome Biol. 2025 Jul 29;26(1):228. doi: 10.1186/s13059-025-03673-9. Genome Biol. 2025. PMID: 40731283 Free PMC article. Review.
-
Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit.Gigascience. 2021 Mar 19;10(3):giab018. doi: 10.1093/gigascience/giab018. Gigascience. 2021. PMID: 33739401 Free PMC article.
-
NPARS-A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science.Front Big Data. 2021 Sep 27;4:725095. doi: 10.3389/fdata.2021.725095. eCollection 2021. Front Big Data. 2021. PMID: 34647017 Free PMC article.
References
-
- Barba L.A. (2016) The hard road to reproducibility. Science, 354, 142.. - PubMed
-
- Begley C.G., Ioannidis J.P. (2015) Reproducibility in science: improving the standard for basic and preclinical research. Circ. Res., 116, 116–126. - PubMed
-
- Burns B., Oppenheimer D. (2016) Design patterns for container-based distributed systems. In: HotCloud’16 Proceedings of the 8th USENIX Conference on Hot Topics in Cloud Computing. pp. 108–113. USENIX Association, Berkeley, CA, USA.
-
- Capuccini M. et al. (2018) KubeNow: a cloud agnostic platform for microservice-oriented applications In: Leahy F., Franco J. (eds), 2017 Imperial College Computing Student Workshop (ICCSW 2017), Volume 60 of OpenAccess Series in Informatics (OASIcs). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Germany, pp. 9:1–9:2.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources