Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 21;20(3):20220059.
doi: 10.1515/jib-2022-0059. eCollection 2023 Sep 1.

SnakeLines: integrated set of computational pipelines for sequencing reads

Affiliations

SnakeLines: integrated set of computational pipelines for sequencing reads

Jaroslav Budiš et al. J Integr Bioinform. .

Abstract

With the rapid growth of massively parallel sequencing technologies, still more laboratories are utilising sequenced DNA fragments for genomic analyses. Interpretation of sequencing data is, however, strongly dependent on bioinformatics processing, which is often too demanding for clinicians and researchers without a computational background. Another problem represents the reproducibility of computational analyses across separated computational centres with inconsistent versions of installed libraries and bioinformatics tools. We propose an easily extensible set of computational pipelines, called SnakeLines, for processing sequencing reads; including mapping, assembly, variant calling, viral identification, transcriptomics, and metagenomics analysis. Individual steps of an analysis, along with methods and their parameters can be readily modified in a single configuration file. Provided pipelines are embedded in virtual environments that ensure isolation of required resources from the host operating system, rapid deployment, and reproducibility of analysis across different Unix-based platforms. SnakeLines is a powerful framework for the automation of bioinformatics analyses, with emphasis on a simple set-up, modifications, extensibility, and reproducibility. The framework is already routinely used in various research projects and their applications, especially in the Slovak national surveillance of SARS-CoV-2.

Keywords: computational pipeline; framework; massively parallel sequencing; reproducibility; virtual environment.

PubMed Disclaimer

Conflict of interest statement

Competing interests: JB, WK, MK, RH, ML, DS, AB, FD, JG, JR, TS are the employees of Geneton Ltd. that is a provider of bioinformatics services in Slovakia. All remaining authors have declared no conflicts of interest.

Figures

Figure 1:
Figure 1:
Standard execution of a SnakeLines pipeline: The user supplies the configuration file and genomic data originating from a sequencing run together with the required set of pipeline-specific files, such as a reference genome. Based on the configuration, SnakeLines identifies required Snakemake rules, bioinformatics tools, and their parameters. The exact steps of the computational pipeline are dynamically assembled and automatically executed in virtual environments using the Snakemake workflow engine. The output of the pipeline is a set of generated genomic files and a set of associated quality reports.
Figure 2:
Figure 2:
Basic variant calling pipeline constructed from the user-supplied configuration: (A) SnakeLines runs an analysis on specified FASTQ files (columns Sample 1, Sample 2). Each block of configuration (B)–(G) represents a set of SnakeLines rules that are automatically assembled into computational pipelines using the Snakemake workflow engine. The steps are gradually executed according to the generated workflow. (H) Essential output files are copied to the specified directory at the end of the analysis and users are notified by email messages.
Figure 3:
Figure 3:
Benchmarking variant calling pipelines: SnakeLines facilitated the comparison between numerous configurations (only a handful shown for clarity) of variant calling pipelines for single nucleotide variants (right column) and small insertions and deletions (left column). The recall (top row) and precision (bottom row) have been considerably improved (arrows) between the initial (baseline) and the best performing configuration (fastp_bwa_deepvariant_decoy).
Figure 4:
Figure 4:
Selected reports generated by the Variant calling pipeline for the Sars-CoV-2 national surveillance in Slovakia, implemented in SnakeLines: (A) essential quality control metrics of raw sequencing reads, generated by FastQC [34]; (B) mapping coverage along the reference genome, generated by Qualimap BamQC [35]. (C) The consensus sequence of the analysed Sars-CoV-2 virus, shortened for clarity; (D) phylogenetic assignment of the sequenced virus, generated by the Pangolin [71]; (E) detected genomic variants in VCF format, shortened for clarity.

References

    1. Munafò MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, du Sert NP, et al. A manifesto for reproducible science. Nat Human Behav. 2017;1:0021. doi: 10.1038/s41562-016-0021. - DOI - PMC - PubMed
    1. Leipzig J. A review of bioinformatic pipeline frameworks. Briefings Bioinf. 2017;18:530–6. doi: 10.1093/bib/bbw020. - DOI - PMC - PubMed
    1. Afgan E, Baker D, Batut B, van den Beek M, Bouvier D, Cech M, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018;46:W537–44. doi: 10.1093/nar/gky379. - DOI - PMC - PubMed
    1. Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, et al. The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res. 2013;41:W557–61. doi: 10.1093/nar/gkt328. - DOI - PMC - PubMed
    1. Cingolani P, Sladek R, Blanchette M. BigDataScript: a scripting language for data pipelines. Bioinformatics. 2015;31:10–6. doi: 10.1093/bioinformatics/btu595. - DOI - PMC - PubMed

MeSH terms

LinkOut - more resources