Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 May 25;13(5):e1005425.
doi: 10.1371/journal.pcbi.1005425. eCollection 2017 May.

Jupyter and Galaxy: Easing entry barriers into complex data analyses for biomedical researchers

Affiliations

Jupyter and Galaxy: Easing entry barriers into complex data analyses for biomedical researchers

Björn A Grüning et al. PLoS Comput Biol. .

Abstract

What does it take to convert a heap of sequencing data into a publishable result? First, common tools are employed to reduce primary data (sequencing reads) to a form suitable for further analyses (i.e., the list of variable sites). The subsequent exploratory stage is much more ad hoc and requires the development of custom scripts and pipelines, making it problematic for biomedical researchers. Here, we describe a hybrid platform combining common analysis pathways with the ability to explore data interactively. It aims to fully encompass and simplify the "raw data-to-publication" pathway and make it reproducible.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of steps involved in performing analyses outlined in Examples 1 and 2.
A. Example 1. Right (green) side of Galaxy interface is the history pane. The analysis begins with uploading three Illumina datasets (datasets 1–3) and a reference genome sequence (dataset 4). Datasets are mapped to the reference genome with bwa-mem (datasets 5–7) and read groups are assigned (datasets 8–10). This allows resulting BAM datasets to be merged into a single BAM file (dataset 11). At this point, the Jupyter IE is launched. Lower part of the notebook is visible in the center pane, showing the read coverage distribution for the three isolates (three different colors). B. A similar screenshot for Example 2. Here, Illumina reads for two RNA-seq replicates from wildtype and snf2 knock-out are mapped against the Drosophila melanogaster genome (dm3) using HiSat split mapper. Next, HTSeq-count takes BAM datasets generated by HiSat and, using gene annotation for dm3 genome downloaded from the UCSC Table Browser (history dataset 9), computes per-gene read counts. These counts are then imported to Jupyter (center pane) to perform normalization and variance shrinkage calculations using Bioconductor's DESeq2 package.
Fig 2
Fig 2. Reanalysis of data from [14] using Galaxy and Jupyter.
A. Workflow used in the analysis. As an input, the workflow takes a collection of paired Illumina datasets and outputs an unfiltered list of variable sites. B. Galaxy history showing all steps of these analyses. It only contains 12 steps because we use dataset collections to combine multiple similar datasets into a small number of history entries. This significantly simplifies processing. For example, collection 313 contains all 312 paired-end Illumina datasets generated for this study. This allows us to deal with just one history item instead of 312. The next item in the history is a collection of BAM datasets generated by mapping each read-pair from collection 313 against human genome (hg38) with bwa-mem. These BAM datasets are de-duplicated (collection 627), filtered (by only retaining reads mapping to mitochondrial DNA, with mapping quality of 20 or higher, and mapped in a proper pair; collection 941), realigned to mitigate misalignment around indels or structural variant calls (collection 1098), and used to call variants with Naive Variant Caller [21]. Finally, we use Variant Annotator to process VCF datasets generated by Naive Variant Caller and to create a list of variants (collection 1412) and the concatenation tool to reduce collection 1412 into a single table (dataset 1413). This dataset is used for further processing with Jupyter. C. The relationship of minor allele frequencies for heteroplasmic sites between tissues (panels A and B) and individuals (panels C and D). D. Estimates for bottleneck size with (red) and without (blue) accounting for mitotic segregation.

References

    1. Fleury V, Gouyet JF, Leonetti M. Branching in Nature. Dynamics and Morphogenesis of Branching Structures, from Cell to River Networks. Springer Science & Business Media; 2013. Available from: http://books.google.com/books?id=WKXyCAAAQBAJ&pg=PR6&dq=intitle:branchin....
    1. van der Walt S, Colbert SC, Varoquaux G. The NumPy Array: A Structure for Efficient Numerical Computation. Comput Sci Eng. 2011;13(2):22–30.
    1. Jones E, Oliphant T, Peterson P. SciPy: Open source scientific tools for Python, 2001-2008b;. Available from: https://www.scipy.org/
    1. Hunter JD. Matplotlib: A 2D Graphics Environment. Comput Sci Eng. 2007;9(3):90–95.
    1. Sloggett C, Goonasekera N, Afgan E. BioBlend: automating pipeline analyses within Galaxy and CloudMan. Bioinformatics. 2013;29(13):1685–1686. doi: 10.1093/bioinformatics/btt199 - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources