Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Dec 21:6:2162.
doi: 10.12688/f1000research.13049.2. eCollection 2017.

RSEQREP: RNA-Seq Reports, an open-source cloud-enabled framework for reproducible RNA-Seq data processing, analysis, and result reporting

Affiliations

RSEQREP: RNA-Seq Reports, an open-source cloud-enabled framework for reproducible RNA-Seq data processing, analysis, and result reporting

Travis L Jensen et al. F1000Res. .

Abstract

RNA-Seq is increasingly being used to measure human RNA expression on a genome-wide scale. Expression profiles can be interrogated to identify and functionally characterize treatment-responsive genes. Ultimately, such controlled studies promise to reveal insights into molecular mechanisms of treatment effects, identify biomarkers, and realize personalized medicine. RNA-Seq Reports (RSEQREP) is a new open-source cloud-enabled framework that allows users to execute start-to-end gene-level RNA-Seq analysis on a preconfigured RSEQREP Amazon Virtual Machine Image (AMI) hosted by AWS or on their own Ubuntu Linux machine via a Docker container or installation script. The framework works with unstranded, stranded, and paired-end sequence FASTQ files stored locally, on Amazon Simple Storage Service (S3), or at the Sequence Read Archive (SRA). RSEQREP automatically executes a series of customizable steps including reference alignment, CRAM compression, reference alignment QC, data normalization, multivariate data visualization, identification of differentially expressed genes, heatmaps, co-expressed gene clusters, enriched pathways, and a series of custom visualizations. The framework outputs a file collection that includes a dynamically generated PDF report using R, knitr, and LaTeX, as well as publication-ready table and figure files. A user-friendly configuration file handles sample metadata entry, processing, analysis, and reporting options. The configuration supports time series RNA-Seq experimental designs with at least one pre- and one post-treatment sample for each subject, as well as multiple treatment groups and specimen types. All RSEQREP analyses components are built using open-source R code and R/Bioconductor packages allowing for further customization. As a use case, we provide RSEQREP results for a trivalent influenza vaccine (TIV) RNA-Seq study that collected 1 pre-TIV and 10 post-TIV vaccination samples (days 1-10) for 5 subjects and two specimen types (peripheral blood mononuclear cells and B-cells).

Keywords: RNA-Seq; RSEQREP; cloud computing; differential gene expression; pathway enrichment; reproducible research; transcriptomics; trivalent influenza vaccine.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. RNA-Seq Reports (RSEQREP) implementation overview.
RSEQREP provides a reproducible start-to-end analysis solution for RNA-Seq data by automating (1) reference dataset initialization/download, (2) RNA-Seq data processing (3) RNA-Seq analysis, and (4) reporting including a summary PDF report and publication-ready table and figure files. Steps can be run in a modular fashion and key computational metrics are tracked in a SQLite database. The software runs on a pre-configured RSEQREP AMI or on a local Ubuntu Linux machine. Users can customize individual steps and enter their experimental design information via an Excel configuration file.
Figure 2.
Figure 2.. Global gene expression pattern analysis to identify outliers and batch effects (influenza vaccine case study).
RSEQREP supports multivariate visualizations, including principal component analysis (PCA) to visualize key trends in the data. The analysis uses standardized log 2 counts per million (mapped reads) for genes that met the low expression cut off as input. As shown for the influenza case study, the PCA analysis indicated that PBMC (highlighted in red) and B-cell (highlighted in blue) samples differ substantially in their transcriptional profiles. In addition, two outliers were identified in relation to the other samples (highlighted in blue circles). Ellipses represent the 95% confidence interval for the bivariate mean based on the first two principal components by specimen type.
Figure 3.
Figure 3.. UpSet plots to summarize key differentially expressed (DE) gene time trends (influenza vaccine case study).
These panels summarize the DE gene overlap between post-treatment days for up- or down-regulated DE genes (shown to the right in black), for up-regulated DE genes (shown in the middle in red), and down-regulated DE genes (shown to the right in blue), respectively within specimen type (B-cells are shown in the top row, PBMCS in the bottom row). In each panel, the bottom left horizontal bar graph labeled SDEG Set Size shows the total number of DE genes per post-treatment time point. The circles in each panel’s matrix represent what would be the different Venn diagram sections (unique and overlapping DE genes). Connected circles indicate a certain intersection of DE genes between post-treatment days. The top bar graph in each panel summarizes the number of DE genes for each unique or overlapping combination. In the top left panel, for example, the first vertical bar/column shows those DE genes that are unique to day 6 (169 DE genes). The second shows those DE genes that are shared only between days 6 and 7 (124 DE genes). The third are those DE genes that are shared between days 6, 7, and 8 (72 DE genes), and so forth. As shown for the influenza case study, most of the DE genes for B-cells were detected and overlapped between days 5, 6, 7, or 8 while most of the DE genes for PBMCs were uniquely identified at day 1.
Figure 4.
Figure 4.. Heatmaps for visualizing pathway enrichment over time (influenza vaccine case study).
Reactome pathways that were enriched in at least two conditions are shown. Cells are color-coded by enrichment score: -1 × log 10(FDR-adjusted p-value). Cell values represent the number of DE genes that overlap with a certain pathway. Numbers in brackets indicate enriched pathways, i.e. pathways that met the specified FDR-adjusted p-value cut off. Pathways were clustered based on enrichment score. As shown for the influenza case study, pathways related to cell-cycle as well as protein metabolism were enriched in B-cells at day 6. Both, B-cell and PBMCs showed an enrichment of interferon signaling-related pathways at day 1.
Figure 5.
Figure 5.. Co-expressed gene cluster time trends (influenza vaccine case study).
RSEQREP supports unsupervised multiscale bootstrap resampling to identify co-expressed gene clusters based on their log 2 fold change pattern over time. A subset of trends is shown for the influenza case study. Several co-expressed immunoglobulin genes reached peak log 2 fold changes compared to pre-treatment between day 6 and 8 while a cluster of interferon-induced antiviral ( IFIT) genes showed an earlier peak in log 2 fold change at day 1 in addition to a peak at day 8 in PBMCs.
Figure 6.
Figure 6.. Wall clock time benchmarks for RNA-Seq pre-processing steps by AWS EC2 instance type.
Metrics are based on 110 influenza case study RNA-Seq samples. The following instance types were used: c3.xlarge (4 vCPUs, 7.5 GiB Mem), c3.2xlarge (8 vCPUs, 15 GiB Mem), c3.4xlarge (16 vCPUs, 30 GiB Mem), c3.8xlarge (32 vCPUs, 60 GiB Mem). Median wall clock time is summarized as tracked in the RSEQREP SQLite database. The biggest relative reduction in wall clock time across processes was observed when switching from the 4 vCPU to the 8 vCPU instance type (c3.xlarge vs. c3.2xlarge). Higher core machines (16 and 32 vCPUs) did result in further reduced wall clock time for completing reference alignments (HISAT2) and gene expression quantification (Subread) but the change was not as substantial.

References

    1. Sboner A, Mu XJ, Greenbaum D, et al. : The real cost of sequencing: higher than you think! Genome Biol. 2011;12(8):125. 10.1186/gb-2011-12-8-125 - DOI - PMC - PubMed
    1. Goecks J, Nekrutenko A, Taylor J, et al. : Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11(8):R86. 10.1186/gb-2010-11-8-r86 - DOI - PMC - PubMed
    1. Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106. 10.1186/gb-2010-11-10-r106 - DOI - PMC - PubMed
    1. Anders S, McCarthy DJ, Chen Y, et al. : Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat Protoc. 2013;8(9):1765–1786. 10.1038/nprot.2013.099 - DOI - PubMed
    1. Krampis K, Booth T, Chapman B, et al. : Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics. 2012;13:42. 10.1186/1471-2105-13-42 - DOI - PMC - PubMed