Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2021 Nov 4;11(1):21680.
doi: 10.1038/s41598-021-99288-8.

Design considerations for workflow management systems use in production genomics research and the clinic

Affiliations
Comparative Study

Design considerations for workflow management systems use in production genomics research and the clinic

Azza E Ahmed et al. Sci Rep. .

Abstract

The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
A WfMS is middleware between the analyst and the computational environment. It encompasses the workflow language specifications to interconnect the analysis executables, and the execution engine to dispatch tasks and manage dependencies on the compute infrastructure.
Figure 2
Figure 2
Bioinformatics workflows with multiple levels of complexity warrant a modular construction. It is easiest to program the workflow when its logic is abstracted away (in Tasks, red) from the command line invocations (in Bash scripts, pink) of the bioinformatics tools (light pink). Individual workflows can be further used as subworkflows of a larger Master workflow (e.g., Fig Supplementary 1). This architecture facilitates expression of additional complexity due to optional modules (dashed line), nested levels of parallelism (groups of arrows connecting red rectangles) and scatter-gather patterns (task 2 scattered across samples being merged into task 3).
Figure 3
Figure 3
Scaling a one-step (solid line) and two-step (dashed line) workflow in Cromwell+WDL (black) and Nextflow (yellow) on AWS Parallel cluster. The thick green line in the right panel is the theoretical optimum of the number of nodes to be occupied by the tasks, computed as the ceiling of tasks/cores-per-node (96). Empty circles denote failed runs.
Figure 4
Figure 4
DAGs corresponding to a simple workflow of 2 processes (besides output aggregation) used to assess the scalability of the executors of “Scalability” section, as generated by the most recent version of each executor or utility visualizer of each language in July 2021.

References

    1. Bell G, Hey T, Szalay A. Computer science: Beyond the data deluge. Science. 2009;323:1297–1298. doi: 10.1126/science.1170411. - DOI - PubMed
    1. Deelman E, et al. The future of scientific workflows. Int. J. High Perform. Comput. Appl. 2017;32:159–175. doi: 10.1177/1094342017704893. - DOI
    1. Stephens ZD, et al. Big data: Astronomical or genomical? PLoS Biol. 2015;13:e1002195. doi: 10.1371/journal.pbio.1002195. - DOI - PMC - PubMed
    1. Hines, J. Genomics code exceeds exaops on summit supercomputer: Oak ridge leadership computing facility (2018).
    1. Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 2018;19:325–325. doi: 10.1038/nrg.2018.8. - DOI - PubMed

Publication types