Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr 11;35(4):314-316.
doi: 10.1038/nbt.3772.

Toil enables reproducible, open source, big biomedical data analyses

Affiliations

Toil enables reproducible, open source, big biomedical data analyses

John Vivian et al. Nat Biotechnol. .
No abstract available

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS

The authors declare competing financial interests: details are available in the online version of the paper.

Figures

Figure 1
Figure 1
RNA-seq pipeline and expression concordance. (a) A dependency graph of the RNA-seq pipeline we developed (named CGL). CutAdapt was used to remove extraneous adapters, STAR was used for alignment and read coverage, and RSEM and Kallisto were used to produce quantification data. (b) Scatter plot showing the Pearson correlation between the results of the TCGA best-practices pipeline and the CGL pipeline. 10,000 randomly selected sample and/or gene pairs were subset from the entire TCGA cohort and the normalized counts were plot against each other; this process was repeated five times with no change in Pearson correlation. The unit for counts is: log2(norm_counts+1).
Figure 2
Figure 2
Costs and core usage. (a) Scaling tests were run to ascertain the price per sample at varying cluster sizes for the different analysis methods. TCGA (red) shows the cost of running the TCGA best-practices pipeline as re-implemented as a Toil workflow (for comparison). CGL-One-Sample/Node (cyan) shows the cost of running the revised Toil pipeline, one sample per node. CGL (blue) denotes the pipeline running samples across many nodes. CGL-Spot (green) is the same as CGL, but denotes the pipeline run on the Amazon spot market. The slight rise in cost per sample at 32,000 cores was due to a couple of factors: aggressive instance provisioning directly affected the spot price (dotted line), and saving bam and bedGraph files for each sample. (b) Tracking number of cores during the recompute. The two red circles indicate where all worker nodes were terminated and subsequently restarted shortly thereafter.

References

    1. Weinstein JN, et al. Nat Genet. 2013;45:1113–1120. - PMC - PubMed
    1. Zhang J, et al. Database. 2011 http://dx.doi.org/10.1093/database/bar026. - DOI - PubMed
    1. Siva N. Lancet. 2015;385:103–104. - PubMed
    1. McKenna A, et al. Genome Res. 2010;20:1297–1303. - PMC - PubMed
    1. UNC Bioinformatics. TCGA mRNA-seq pipeline for UNC data. 2013 https://webshare.bioinf.unc.edu/public/mRNAseq_TCGA/UNC_mRNAseq_summary.pdf.

Publication types