Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 1;35(5):839-846.
doi: 10.1093/bioinformatics/bty699.

Container-based bioinformatics with Pachyderm

Affiliations

Container-based bioinformatics with Pachyderm

Jon Ander Novella et al. Bioinformatics. .

Abstract

Motivation: Computational biologists face many challenges related to data size, and they need to manage complicated analyses often including multiple stages and multiple tools, all of which must be deployed to modern infrastructures. To address these challenges and maintain reproducibility of results, researchers need (i) a reliable way to run processing stages in any computational environment, (ii) a well-defined way to orchestrate those processing stages and (iii) a data management layer that tracks data as it moves through the processing pipeline.

Results: Pachyderm is an open-source workflow system and data management framework that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem, having Kubernetes as the backbone for container orchestration. We adapted Pachyderm and demonstrated its attractive properties in bioinformatics. A Helm Chart was created so that researchers can use Pachyderm in multiple scenarios. The Pachyderm File System was extended to support block storage. A wrapper for initiating Pachyderm on cloud-agnostic virtual infrastructures was created. The benefits of Pachyderm are illustrated via a large metabolomics workflow, demonstrating that Pachyderm enables efficient and sustainable data science workflows while maintaining reproducibility and scalability.

Availability and implementation: Pachyderm is available from https://github.com/pachyderm/pachyderm. The Pachyderm Helm Chart is available from https://github.com/kubernetes/charts/tree/master/stable/pachyderm. Pachyderm is available out-of-the-box from the PhenoMeNal VRE (https://github.com/phnmnl/KubeNow-plugin) and general Kubernetes environments instantiated via KubeNow. The code of the workflow used for the analysis is available on GitHub (https://github.com/pharmbio/LC-MS-Pachyderm).

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(a) The Pachyderm daemon. pachd is the Pachyderm daemon managing the pipelining and data versioning features of Pachyderm. The main components of pachd are (i) a file system component, (ii) a block store component and (iii) a pipelining component. The file system component handles all requests related to putting data into and getting data out of Pachyderm Data Repositories (PDRs). To this end, the file system component cooperates with the block store component to content address new data, put new objects in the backing object store, pull objects out of the backing object store, etc. The pipelining system component creates and manages all of the pipeline workers, which execute to process data in Pachyderm pipelines. The pipelining system component cooperates with the file system component to make sure that the correct subsets/versions of data (versioned in PDRs) are provided to the correct pipeline workers, such that data is processed in the sequence and manner specified by users. To coordinate and track all of these actions, pachd stores and queries metadata in etcd, a distributed key/value store that is also deployed in a pod on Kubernetes, and it communicates with the Kubernetes API Server and the backing object-store service. Further, Pachyderm optimizes uploads/downloads of data via an internal caching system. (b) A typical infrastructure and services setup with Pachyderm. A standard Kubernetes cluster contains two major entities represented in two different polygonal figures. Cloud VMs/premise nodes are depicted as hexagons, whereas Kubernetes pods are displayed as rounded rectangles. Optional nodes/pods are depicted with dashed borders. The master node coordinates the rest of the nodes, runs the Kubernetes API and can use a reverse proxy such as Træfik (https://traefik.io/). In the service nodes, all Pachyderm related pods are scheduled: the Pachyderm daemon, Pachyderm pipeline workers and etcd. Also, Minio services can be deployed in service nodes, responsible for upload/download of data to/from the backing storage. The storage dedicated node (optional) is in charge of providing application containers with a shared file system (e.g. GlusterFS), using block storage volumes
Fig. 2.
Fig. 2.
Data provenance and incremental processing in Pachyderm. The upper part of the figure shows the Pachyderm Data Repositories (PDRs) present in the Pachyderm File System (PFS) after creating a bioinformatics workflow. These repositories contain a tree-like structure in which each node represents a separate commit. In the lower part of the figure, the different pipeline stages of the workflow are displayed in the Pachyderm Pipeline System (PPS), together with their corresponding inputs and outputs. When a new data commit (green colour) is added to the input data Repo A, a new pipeline stage is triggered for Tool B in the PPS, leading in turn to a new commit (blue colour) in Tool B’s output repository. Thereby, the provenance of the blue commit made on Repo B would be: (i) the green commit from Repo A and (ii) Tool B’s pipeline specification. Note that the commit structure looks similar for the two data repositories because of the nature of a linear data pipeline. In the figure, the repos created by Tool C and Tool D do not have the corresponding commits as the data processing has not yet reached this level in the pipeline. As new commits are added into the PFS, the PPS triggers the corresponding pipeline stages with the new datums (minimal computing units) from the commit. This phenomenon can be referred to as incremental processing, as only new computing units are processed. These new datums are then computed, creating further commits on downstream repositories and providing a mechanism to track the provenance of the computations
Fig. 3.
Fig. 3.
LC-MS workflow definition. The workflow consists of five main components including quantification, matching and filtering, annotation, identification and statistics. The raw MS1 data in open-source format (e.g. mzML) is accepted as input. In the quantification component, the raw data is first centroided, calibrated and the signals from each metabolite are clustered into mass traces. In the matching and filtering component, the retention time drift is corrected and the mass traces are matched across the samples. The non-biologically relevant signals are filtered based on presence/absence in blank samples as well as correlation to dilution series. In the annotation component, the mass traces are annotated with adduct and isotope information. This information is used in the identification component to calculate the neutral mass of the precursor ions. The identification is then performed and the resulting scores are converted to posterior error probability values. The data are then limited to the mass traces annotated with an identification hit and subjected to multivariate data analysis. Note that the pipeline stages chosen for the performance benchmarks are illustrated with dashed borders
Fig. 4.
Fig. 4.
Performance metrics. Each of the figures displays the speedup (right axis, grey line) and scaling efficiency (left axis, black line) obtained when utilizing various numbers of workers with three different tools of the metabolomics workflow

Similar articles

  • Interoperable and scalable data analysis with microservices: applications in metabolomics.
    Emami Khoonsari P, Moreno P, Bergmann S, Burman J, Capuccini M, Carone M, Cascante M, de Atauri P, Foguet C, Gonzalez-Beltran AN, Hankemeier T, Haug K, He S, Herman S, Johnson D, Kale N, Larsson A, Neumann S, Peters K, Pireddu L, Rocca-Serra P, Roger P, Rueedi R, Ruttkies C, Sadawi N, Salek RM, Sansone SA, Schober D, Selivanov V, Thévenot EA, van Vliet M, Zanetti G, Steinbeck C, Kultima K, Spjuth O. Emami Khoonsari P, et al. Bioinformatics. 2019 Oct 1;35(19):3752-3760. doi: 10.1093/bioinformatics/btz160. Bioinformatics. 2019. PMID: 30851093 Free PMC article.
  • PhenoMeNal: processing and analysis of metabolomics data in the cloud.
    Peters K, Bradbury J, Bergmann S, Capuccini M, Cascante M, de Atauri P, Ebbels TMD, Foguet C, Glen R, Gonzalez-Beltran A, Günther UL, Handakas E, Hankemeier T, Haug K, Herman S, Holub P, Izzo M, Jacob D, Johnson D, Jourdan F, Kale N, Karaman I, Khalili B, Emami Khonsari P, Kultima K, Lampa S, Larsson A, Ludwig C, Moreno P, Neumann S, Novella JA, O'Donovan C, Pearce JTM, Peluso A, Piras ME, Pireddu L, Reed MAC, Rocca-Serra P, Roger P, Rosato A, Rueedi R, Ruttkies C, Sadawi N, Salek RM, Sansone SA, Selivanov V, Spjuth O, Schober D, Thévenot EA, Tomasoni M, van Rijswijk M, van Vliet M, Viant MR, Weber RJM, Zanetti G, Steinbeck C. Peters K, et al. Gigascience. 2019 Feb 1;8(2):giy149. doi: 10.1093/gigascience/giy149. Gigascience. 2019. PMID: 30535405 Free PMC article.
  • Automated workflow composition in mass spectrometry-based proteomics.
    Palmblad M, Lamprecht AL, Ison J, Schwämmle V. Palmblad M, et al. Bioinformatics. 2019 Feb 15;35(4):656-664. doi: 10.1093/bioinformatics/bty646. Bioinformatics. 2019. PMID: 30060113 Free PMC article.
  • Navigating freely-available software tools for metabolomics analysis.
    Spicer R, Salek RM, Moreno P, Cañueto D, Steinbeck C. Spicer R, et al. Metabolomics. 2017;13(9):106. doi: 10.1007/s11306-017-1242-7. Epub 2017 Aug 9. Metabolomics. 2017. PMID: 28890673 Free PMC article. Review.
  • Containers in Bioinformatics: Applications, Practical Considerations, and Best Practices in Molecular Pathology.
    Kadri S, Sboner A, Sigaras A, Roy S. Kadri S, et al. J Mol Diagn. 2022 May;24(5):442-454. doi: 10.1016/j.jmoldx.2022.01.006. Epub 2022 Feb 18. J Mol Diagn. 2022. PMID: 35189355 Review.

Cited by

References

    1. Afgan E. et al. (2016) The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res., 44, W3–W10. - PMC - PubMed
    1. Barba L.A. (2016) The hard road to reproducibility. Science, 354, 142.. - PubMed
    1. Begley C.G., Ioannidis J.P. (2015) Reproducibility in science: improving the standard for basic and preclinical research. Circ. Res., 116, 116–126. - PubMed
    1. Burns B., Oppenheimer D. (2016) Design patterns for container-based distributed systems. In: HotCloud’16 Proceedings of the 8th USENIX Conference on Hot Topics in Cloud Computing. pp. 108–113. USENIX Association, Berkeley, CA, USA.
    1. Capuccini M. et al. (2018) KubeNow: a cloud agnostic platform for microservice-oriented applications In: Leahy F., Franco J. (eds), 2017 Imperial College Computing Student Workshop (ICCSW 2017), Volume 60 of OpenAccess Series in Informatics (OASIcs). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Germany, pp. 9:1–9:2.

Publication types