Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 18:10:33.
doi: 10.12688/f1000research.29032.3. eCollection 2021.

Sustainable data analysis with Snakemake

Affiliations

Sustainable data analysis with Snakemake

Felix Mölder et al. F1000Res. .

Abstract

Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.

Keywords: adaptability; data analysis; reproducibility; scalability; sustainability; transparency; workflow management.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. Hierarchy of aspects to consider for sustainable data analysis.
By supporting the top layer, a workflow management system can promote the center layer, and thereby help to obtain true sustainability.
Figure 2.
Figure 2.. Citations and development of Snakemake.
( a) cumulative number of git commits over time; Releases are marked as vertical ticks. ( b) citations by year of the original and this Snakemake article (only complete years shown). ( c) citations by scientific discipline of the citing articles. Considered are both this publication (rolling) and the original publication (original, 18). Data sources: https://badge.dimensions.ai/details/id/pub.1018944052, https://badge.dimensions.ai/details/id/pub.1137313608, 2024/09/20.
Figure 3.
Figure 3.. Example Snakemake workflow.
( a) workflow definition; hypothesized knowledge requirement for line readability is color-coded on the left next to the line numbers. ( b) directed acyclic graph (DAG) of jobs, representing the automatically derived execution plan from the example workflow; job node colors reflect rule colors in the workflow definition. ( c) content of script plot-hist.py referred from rule plot_histogram. ( d) knowledge requirements for readability by statement category (see subsection 3.3). The example workflow downloads data, plots histograms of city populations within a given list of countries, and converts these from SVG to PDF format. Note that this is solely meant as a short yet comprehensive demonstration of the Snakemake syntax.
Figure 4.
Figure 4.. Snakemake scheduling problem.
( a) Example workflow DAG. The greenish area depicts the jobs that are ready for scheduling (because all input files are present) at a given time during the workflow execution. We assume that the red job at the root generates a temporary file, which may be deleted once all blue jobs are finished. ( b) Suboptimal scheduling solution: two green jobs are scheduled, such that only one blue job can be scheduled and the temporary file generated by the red job has to remain on disk until all blue jobs are finished in a subsequent scheduling step. ( c) Optimal scheduling solution: the three blue jobs are scheduled, such that the temporary file generated by the red job can be deleted afterwards.
Figure 5.
Figure 5.. Blockchain-hashing based between workflow caching scheme of Snakemake.
If a job is eligible for caching, its code, parameters, raw input files, software environment and the hashes of its dependencies are used to calculate a SHA-256 hash value, under which the output files are stored in a central cache. Subsequent runs of the same job (with the same dependencies) in other workflows can skip the execution and directly take the output files from the cache.
Figure 6.
Figure 6.. Job graph partitioning by assigning rules to groups.
Two rules of the example workflow ( Figure 3a) are grouped together, ( a) spanning one connected component, ( b) spanning two connected components, and ( c) spanning five connected components. Resulting submitted group jobs are represented as grey boxes.
Figure 7.
Figure 7.. Workflow composition capabilities of Snakemake.
Single or multiple external workflows can be declared as modules, along with the selection of all or specific rules. Properties of rules can be overwritten, and the analysis can be extended with further rules.
Figure 8.
Figure 8.. Additional design patterns for Snakemake workflows.
For brevity only rule properties that are necessary to understand each example are shown (e.g. omitting log directives and shell commands or script directives). ( a) scatter/gather process, ( b) streaming, ( c) non-file parameters, ( d) iteration, ( e) sample sheet based configuration, ( f) conditional execution, ( g) benchmarking, ( h) parameter space exploration. See subsection 3.2 for details.
Figure 9.
Figure 9.. Runtime and memory usage of Snakemake while building the graph of jobs depending on the number of jobs in the workflow.
The Snakemake workflow generating the results along with a self-contained Snakemake report that connects results and provenance information is available at https://doi.org/10.5281/zenodo.4244143.

References

    1. Baker M: 1,500 scientists lift the lid on reproducibility. Nature. 2016;533(7604):452–4. 10.1038/533452a - DOI - PubMed
    1. Mesirov JP: Computer science. Accessible reproducible research. Science. 2010;327(5964):415–6. 10.1126/science.1179653 - DOI - PMC - PubMed
    1. Munafò MR, Nosek BA, Bishop DVM, et al. : A manifesto for reproducible science. Nat Hum Behav. 2017;1:0021. 10.1038/s41562-016-0021 - DOI - PMC - PubMed
    1. Afgan E, Baker D, Batut B, et al. : The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018;46(W1):W537–W544. 10.1093/nar/gky379 - DOI - PMC - PubMed
    1. Berthold MR, Cebron N, Dill F, et al. : KNIME: the Konstanz Information Miner.In: Studies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007).Springer,2007. Reference Source