Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 29;22(1):323.
doi: 10.1186/s13059-021-02533-6.

recount3: summaries and queries for large-scale RNA-seq expression and splicing

Affiliations

recount3: summaries and queries for large-scale RNA-seq expression and splicing

Christopher Wilks et al. Genome Biol. .

Abstract

We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new Monorail analysis pipeline. To facilitate access to the data, we provide the recount3 and snapcount R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. Monorail can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio .

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests

Figures

Fig. 1
Fig. 1
Overview of resource and grid design. a We use the Monorail system (detailed in panel b) to analyze and summarize data from archived sequencing datasets, yielding the recount3 and Snaptron resources. recount3 consists of five types of data summaries — Quality Control (QC), gene-level quantification, exon-level quantification, junction counts, and per-base coverages — packaged into tables with associated metadata and organized at the study level. The Snaptron resource consists of the same summaries, but organized at the level of a “collection,” where a single collection might include many studies, and indexed in a way that permits fast queries. Bottom: a few common ways that users can query and interact with the resources. QC, gene and exon-level summaries are compiled using gene annotations, but junction and bigWig coverage summaries are compiled without use of an annotation. b Illustration of Monorail’s grid design, split into Orchestration, Analysis and Aggregation layers. Appendix: Figs. 6, 7, 8, 9 and Appendix: Monorail details contain details about Monorail and its analysis and aggregation components
Fig. 2
Fig. 2
Annotation status and cumulative arrival of exon-exon splice junctions. a The number of exon-exon junctions present in at least x human runs accessions. Colors represent different degrees to which the junctions are annotated: completely annotated (purple) both donor and acceptor annotated but not in combination with each other (exon skip, blue), only donor or only acceptor annotated (one annotated, green), and neither annotated (red). Curves are cumulative across categories; thus, the number of unannotated junctions is given by the difference between the purple and red lines. b The same as (a) but for mouse run accessions. c The cumulative number of exon-exon junctions (in millions) across time for four different thresholds for the number of human run accessions where a junction is observed. d The same as (c) but for mouse run accessions
Fig. 3
Fig. 3
Cell-type enrichment of unannotated exon-exon junctions. Heatmap showing the cell-type enrichment results for novel exon-exon junctions similar to the one performed in the ASCOT study [24]. The X-axis shows the bins of the -10 log p-values for cell type specificity using a Mann-Whitney U test comparing coverage within a cell type to coverage in all cell types. Each of the cell types evaluated are shown on the Y-axis with the percent of exon-exon junctions annotated denoted by a color gradient from dark blue (0%) to light yellow (50%) to dark red (100%). The row and column means are shown with the same color scale. The column mean, Col.Mean, shows that the average percent of annotated exon-exon junctions decreases as the cell type specificity increases on the X-axis
Fig. 4
Fig. 4
Non-coding and unannotated transcription. a Expression profiles across GTEx v8 tissues. Density and boxplots showing the tissue specificity of human coding and non-coding mRNAs from the FANTOM-CAT annotation [27] using data from GTEx v8. The specificity is based on the entropy computed from the median expression of each gene across the tissue types, as was done previously. b Evaluation of un-annotated genomic intervals powered by the base-pair coverage data in recount3. Previous work [28] identified un-annotated genomic intervals that are expressed, conserved and overall not artifacts. Using the base-pair coverage data we can validate these genomic intervals compared to 99 sets of randomly selected intervals with the same length distributions matched by chromosome. This panel shows the density for the distribution, across the 99 random sets of intervals, of the mean sum of expression across base-pairs for intervals that are present in at least 500 samples. The red vertical line denotes the observed value from the intervals by [28]
Fig. 5
Fig. 5
Gene expression variation. Principal component analysis of protein coding genes. a All human, bulk samples (165,880 samples) with color indicating the library size. b As (a) but only including samples with a labeled tissue (41,218 samples) colored by tissue. c As (a) but only including blood samples (4731 samples) with color differentiating blood and lymphoblastoid cell lines (LCL) samples. Appendix: Fig 12 shows the variation across bulk and single-cell RNA-seq. The percentage of variance explained by the principal component (PC) is given in the axis labels
Fig. 6
Fig. 6
The Monorail relational database model. Rectangles denote tables and arcs denote the key relationships between tables. This image was created using the sqlalchemy_schemadisplay package. See also Fig. 1 and Monorail details
Fig. 7
Fig. 7
Monorail workflow parallelism. Illustration of how Monorail ensures parallelism both across nodes of the cluster (node manager launching many simultaneous runners) and across cores on a single cluster node (runner launching many simultaneous processes). Figure 8 shows a single Monorail analysis workflow. See also Fig. 1 and Monorail details
Fig. 8
Fig. 8
Monorail analysis workflow. Illustration of key steps and key inputs and outputs of the Monorail analysis workflow. Not shown: reference, gene annotation and index files are also stored on the local filesystem and loaded by tools like STAR and featureCounts as needed. The workflow is driven by Snakemake and runs within a Singularity container on a cluster node. Figure 7 illustrates how many such workflows run in parallel on a cluster. See also Fig. 1 and Monorail details
Fig. 9
Fig. 9
Monorail aggregation workflow. The workflow runs on a single computer, taking all run-level analysis outputs as its input. Its outputs are the study- and collection-level objects and indexes hosted by recount3 and Snaptron. The disjoint exonic interval summaries furnish raw counts that can be summed into both exon- and gene-level counts. The “add motifs” step for junctions adds the specific donor and acceptor dinucleotide patterns present in the reference in genome for that junction. For the “add annotation status” step for junctions, an inclusive set of gene annotations is used, as detailed in Reference files. Not pictured: metadata is also aggregated and hosted by recount3 and Snaptron. Also not pictured: bigWig files are organized into per-study subdirectories on the host filesystem. See also Fig. 1 and Monorail details
Fig. 10
Fig. 10
Dashboard at peak throughput. Example of how the Monorail dashboard captured an instance of peak throughput when using 24 out of our maximum allocated 25 simultaneous Skylake nodes on the Stampede2 supercomputer at the Texas Advanced Computer Center. The red box indicates the hour when we achieved peak throughput of around 3.63 TB per hour. The red arrow points to the data point in the “In Throughput” dashboard element showing that measurement. The green box and arrow indicate that, throughout this time period, the fraction of the wall clock time spent on the download step of the workflow stayed steady at about 15–19%, indicating that downloading from the Sequence Read Archive was not a particular bottleneck during this approximately 12-h period. See also Monorail cost and throughput
Fig. 11
Fig. 11
Dashboard at input bottleneck. Example of dashboard during a period where downloading from the SRA became a bottleneck. The blue rectangles highlight a period where Input Throughput was low (left panel, middle plot) and the fraction of the wall clock time spent on the download step of the workflow climbed to over 90%, staying there for a few hours before the bottleneck resolved. See also Monorail cost and throughput
Fig. 12
Fig. 12
Differentiating bulk and single-cell RNA-seq data. a The distribution of zeroes for different types of labeled human data. bd PCA of human data colored by b library size; c curated labels; d predicted and curated labels. e Percent of zeroes from labeled mouse data. f, g Predicted labels from mouse data f bulk, g single-cell. The percentage of variance explained by the PC is given in the axis labels. Legends for (a-d): Bulk, manual — manually curated bulk samples; Sc, manual — manually curated single-cell samples; Sc, text mining — single-cell samples labeled by text analysis ; Small-RNA, manual — manually curated samples which were identified as small-RNA-seq; Other, manual — manually curated samples which were identified as other assay such as ribosome profiling; Unlabeled — all other samples before we differentiated bulk and single-cell samples using the percentage of zero as the predictor; Predicted Bulk — predicted bulk samples using the predictor; Predicted Sc — predicted single-cell samples using the predictor. Before doing PCA, we removed some samples which we believe are not RNA sequencing samples, which accounts for the sample number difference between (a) and (bd)
Fig. 13
Fig. 13
Principal component analysis of protein coding genes of bulk samples faceted by tissue types. This figure is tissue type faceted version of Fig. 5. Each panel highlights one tissue type and display all other cells as gray points in the back. The percentage of variance explained by the PC is given in the axis labels
Fig. 14
Fig. 14
Variation of hemoglobin genes. a The sum of 3 hemoglobin genes (HBA1, HBA2, HBB) is correlated with PC1. PCC: Pearson Correlation Coefficient. Blood: manually curated blood samples; GTEx: GTEx blood samples; LCL: lymphoblastoid cell lines; WB: manually curated whole blood samples; Other: all other samples, which are colored as gray. b Similar to (a), but now we cut the log2(CPM sum) into four groups. The box shows the median, upper quartile and lower quartile of PC1 values for each group
Fig. 15
Fig. 15
Principal component analysis of human bulk lncRNA data. This figure is similar to Fig. 5, but for lncRNA data. a All human, bulk samples (207,417 samples) with color indicating the library size. b As (a) but only including samples with a labeled tissue (45,077 samples) colored by tissue. c As (a) but only including blood samples (5966 samples) with color differentiating blood and lymphoblastoid cell lines (LCL) samples. The percentage of variance explained by the principal component (PC) is given in the axis labels. Note that the sample size is different to the protein coding gene data because we did separate filtering
Fig. 16
Fig. 16
Principal component analysis of protein coding genes of single-cell samples. Similar to Figure ??, but we are now showing the top 2 PCs of protein coding genes of all human single cell samples (122,776 samples). a All human single cell samples are colored by the library size. b As (a), but only including samples with a cell type label inferred by SingleR package and colored by the cell type (121,837 samples). The percentage of variance explained by the PC is given in the axis labels

References

    1. Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nat Rev Genet. 2018;19(5):325. - PubMed
    1. Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nat Commun. 2018;9(1):1366. - PMC - PubMed
    1. Ziemann M, Kaspi A, El-Osta A. Digital expression explorer 2: a repository of uniformly processed RNA sequencing data. GigaScience. 2019; 8(4). 10.1093/gigascience/giz022. - PMC - PubMed
    1. Tatlow PJ, Piccolo SR. A cloud-based workflow to quantify transcript-expression levels in public cancer compendia. Sci Rep. 2016;6:39259. - PMC - PubMed
    1. Vivian J, Rao AA, Nothaft FA, Ketchum C, Armstrong J, Novak A, Pfeil J, Narkizian J, Deran AD, Musselman-Brown A, Schmidt H, Amstutz P, Craft B, Goldman M, Rosenbloom K, Cline M, O’Connor B, Hanna M, Birger C, Kent WJ, Patterson DA, Joseph AD, Zhu J, Zaranek S, Getz G, Haussler D, Paten B. Toil enables reproducible, open source, big biomedical data analyses. Nat Biotechnol. 2017;35(4):314–316. - PMC - PubMed

Publication types

LinkOut - more resources