Container-based bioinformatics with Pachyderm

Jon Ander Novella¹, Payam Emami Khoonsari², Stephanie Herman^{1

2}, Daniel Whitenack³, Marco Capuccini^{1

4}, Joachim Burman⁵, Kim Kultima², Ola Spjuth¹

Affiliations

¹ Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
² Department of Medical Sciences, Clinical Chemistry, Uppsala University, Uppsala, Sweden.
³ Pachyderm, Inc., San Francisco, CA, USA.
⁴ Department of Information Technology, Uppsala University, Uppsala, Sweden.
⁵ Department of Neuroscience, Uppsala University, Uppsala, Sweden.

PMID: 30101309
PMCID: PMC6394392
DOI: 10.1093/bioinformatics/bty699

Container-based bioinformatics with Pachyderm

Jon Ander Novella et al. Bioinformatics. 2019.

. 2019 Mar 1;35(5):839-846.

doi: 10.1093/bioinformatics/bty699.

Authors

Jon Ander Novella¹, Payam Emami Khoonsari², Stephanie Herman^{1

2}, Daniel Whitenack³, Marco Capuccini^{1

4}, Joachim Burman⁵, Kim Kultima², Ola Spjuth¹

Affiliations

¹ Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
² Department of Medical Sciences, Clinical Chemistry, Uppsala University, Uppsala, Sweden.
³ Pachyderm, Inc., San Francisco, CA, USA.
⁴ Department of Information Technology, Uppsala University, Uppsala, Sweden.
⁵ Department of Neuroscience, Uppsala University, Uppsala, Sweden.

PMID: 30101309
PMCID: PMC6394392
DOI: 10.1093/bioinformatics/bty699

Abstract

Motivation: Computational biologists face many challenges related to data size, and they need to manage complicated analyses often including multiple stages and multiple tools, all of which must be deployed to modern infrastructures. To address these challenges and maintain reproducibility of results, researchers need (i) a reliable way to run processing stages in any computational environment, (ii) a well-defined way to orchestrate those processing stages and (iii) a data management layer that tracks data as it moves through the processing pipeline.

Results: Pachyderm is an open-source workflow system and data management framework that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem, having Kubernetes as the backbone for container orchestration. We adapted Pachyderm and demonstrated its attractive properties in bioinformatics. A Helm Chart was created so that researchers can use Pachyderm in multiple scenarios. The Pachyderm File System was extended to support block storage. A wrapper for initiating Pachyderm on cloud-agnostic virtual infrastructures was created. The benefits of Pachyderm are illustrated via a large metabolomics workflow, demonstrating that Pachyderm enables efficient and sustainable data science workflows while maintaining reproducibility and scalability.

Availability and implementation: Pachyderm is available from https://github.com/pachyderm/pachyderm. The Pachyderm Helm Chart is available from https://github.com/kubernetes/charts/tree/master/stable/pachyderm. Pachyderm is available out-of-the-box from the PhenoMeNal VRE (https://github.com/phnmnl/KubeNow-plugin) and general Kubernetes environments instantiated via KubeNow. The code of the workflow used for the analysis is available on GitHub (https://github.com/pharmbio/LC-MS-Pachyderm).

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
(a) The Pachyderm daemon. pachd is the Pachyderm daemon managing the pipelining and data versioning features of Pachyderm. The main components of pachd are (i) a file system component, (ii) a block store component and (iii) a pipelining component. The file system component handles all requests related to putting data into and getting data out of Pachyderm Data Repositories (PDRs). To this end, the file system component cooperates with the block store component to content address new data, put new objects in the backing object store, pull objects out of the backing object store, etc. The pipelining system component creates and manages all of the pipeline workers, which execute to process data in Pachyderm pipelines. The pipelining system component cooperates with the file system component to make sure that the correct subsets/versions of data (versioned in PDRs) are provided to the correct pipeline workers, such that data is processed in the sequence and manner specified by users. To coordinate and track all of these actions, pachd stores and queries metadata in etcd, a distributed key/value store that is also deployed in a pod on Kubernetes, and it communicates with the Kubernetes API Server and the backing object-store service. Further, Pachyderm optimizes uploads/downloads of data via an internal caching system. (b) A typical infrastructure and services setup with Pachyderm. A standard Kubernetes cluster contains two major entities represented in two different polygonal figures. Cloud VMs/premise nodes are depicted as hexagons, whereas Kubernetes pods are displayed as rounded rectangles. Optional nodes/pods are depicted with dashed borders. The master node coordinates the rest of the nodes, runs the Kubernetes API and can use a reverse proxy such as Træfik (https://traefik.io/). In the service nodes, all Pachyderm related pods are scheduled: the Pachyderm daemon, Pachyderm pipeline workers and etcd. Also, Minio services can be deployed in service nodes, responsible for upload/download of data to/from the backing storage. The storage dedicated node (optional) is in charge of providing application containers with a shared file system (e.g. GlusterFS), using block storage volumes

**Fig. 2.**
Data provenance and incremental processing in Pachyderm. The upper part of the figure shows the Pachyderm Data Repositories (PDRs) present in the Pachyderm File System (PFS) after creating a bioinformatics workflow. These repositories contain a tree-like structure in which each node represents a separate commit. In the lower part of the figure, the different pipeline stages of the workflow are displayed in the Pachyderm Pipeline System (PPS), together with their corresponding inputs and outputs. When a new data commit (green colour) is added to the input data Repo A, a new pipeline stage is triggered for Tool B in the PPS, leading in turn to a new commit (blue colour) in Tool B’s output repository. Thereby, the provenance of the blue commit made on Repo B would be: (i) the green commit from Repo A and (ii) Tool B’s pipeline specification. Note that the commit structure looks similar for the two data repositories because of the nature of a linear data pipeline. In the figure, the repos created by Tool C and Tool D do not have the corresponding commits as the data processing has not yet reached this level in the pipeline. As new commits are added into the PFS, the PPS triggers the corresponding pipeline stages with the new datums (minimal computing units) from the commit. This phenomenon can be referred to as incremental processing, as only new computing units are processed. These new datums are then computed, creating further commits on downstream repositories and providing a mechanism to track the provenance of the computations

**Fig. 3.**
LC-MS workflow definition. The workflow consists of five main components including quantification, matching and filtering, annotation, identification and statistics. The raw MS¹ data in open-source format (e.g. mzML) is accepted as input. In the quantification component, the raw data is first centroided, calibrated and the signals from each metabolite are clustered into mass traces. In the matching and filtering component, the retention time drift is corrected and the mass traces are matched across the samples. The non-biologically relevant signals are filtered based on presence/absence in blank samples as well as correlation to dilution series. In the annotation component, the mass traces are annotated with adduct and isotope information. This information is used in the identification component to calculate the neutral mass of the precursor ions. The identification is then performed and the resulting scores are converted to posterior error probability values. The data are then limited to the mass traces annotated with an identification hit and subjected to multivariate data analysis. Note that the pipeline stages chosen for the performance benchmarks are illustrated with dashed borders

**Fig. 4.**
Performance metrics. Each of the figures displays the speedup (right axis, grey line) and scaling efficiency (left axis, black line) obtained when utilizing various numbers of workers with three different tools of the metabolomics workflow

See this image and copyright information in PMC

References

1. Afgan E. et al. (2016) The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res., 44, W3–W10. - PMC - PubMed
1. Barba L.A. (2016) The hard road to reproducibility. Science, 354, 142.. - PubMed
1. Begley C.G., Ioannidis J.P. (2015) Reproducibility in science: improving the standard for basic and preclinical research. Circ. Res., 116, 116–126. - PubMed
1. Burns B., Oppenheimer D. (2016) Design patterns for container-based distributed systems. In: HotCloud’16 Proceedings of the 8th USENIX Conference on Hot Topics in Cloud Computing. pp. 108–113. USENIX Association, Berkeley, CA, USA.
1. Capuccini M. et al. (2018) KubeNow: a cloud agnostic platform for microservice-oriented applications In: Leahy F., Franco J. (eds), 2017 Imperial College Computing Student Workshop (ICCSW 2017), Volume 60 of OpenAccess Series in Informatics (OASIcs). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Germany, pp. 9:1–9:2.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Container-based bioinformatics with Pachyderm

Affiliations

Container-based bioinformatics with Pachyderm

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources