Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 2;12(1):1134.
doi: 10.1038/s41597-025-05376-z.

Consistently processed RNA sequencing data from 50 sources enriched for pediatric data

Affiliations

Consistently processed RNA sequencing data from 50 sources enriched for pediatric data

Holly C Beale et al. Sci Data. .

Abstract

Larger cohorts improve the power of tumor gene expression analysis, but the signal is muddied if datasets are processed using different methods or have inaccurate metadata. Here we present five compendia containing consistently processed gene expression data derived from 16,446 diverse RNA sequencing datasets. To create the compendia, we obtained access to RNA sequence data from repositories containing public data as well as clinical partners with access to non-published data. We then assessed the quality, quantified gene expression, harmonized clinical metadata, and released the expression values and metadata without access restrictions. These datasets have been used for diverse projects ranging from identifying similarities between tumor types to assessing how well cell lines recapitulate tumors. They have also been used for n-of-1 analysis to identify genes with unusual expression patterns in a single sample and to infer molecular diagnosis. The comparison to new data is enabled by our dockerized, freely available pipeline. The compendia have been cited in at least 20 publications.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Distribution of ages in compendia. The three most common diseases in pediatric, adolescent and young adult (pedaya, age <= 30) datasets in each compendium are specified by color. Datasets without associated ages (n = 974) are excluded from this plot.
Fig. 2
Fig. 2
Steps for assembling a gene expression data compendium derived from RNA sequencing.
Fig. 3
Fig. 3
TumorMap visualization of datasets in the PolyA Tumor compendium version 11. (a) All 12,747 datasets; each point represents one dataset. Position is based on similarity of gene expression. Colors indicate the diagnosis of the donor. The circled group are mostly synovial sarcomas. (b) Synovial sarcoma and related datasets (red are synovial sarcoma; gray are other diseases). (c) Study sources of the synovial sarcoma datasets: SRP126664 (brown), phs000178 (light blue), phs000673.v2.p1 (green), data from two unrelated collaborators that was unpublished at the time of the compendium release (blue).

Similar articles

References

    1. Tomida, S. et al. Gene expression-based, individualized outcome prediction for surgically treated lung cancer patients. Oncogene23, 5360–5370 (2004). - PubMed
    1. Xu, X. et al. Differential gene expression profiling of gastric intraepithelial neoplasia and early-stage adenocarcinoma. World J. Gastroenterol. WJG20, 17883–17893 (2014). - PMC - PubMed
    1. Newton, Y. et al. Comparative RNA-Sequencing analysis benefits a pediatric patient with relapsed cancer. JCO Precis. Oncol. 1–16 10.1200/PO.17.00198 (2018). - PMC - PubMed
    1. Reed, M. R. et al. A Functional Precision Medicine Pipeline Combines Comparative Transcriptomics and Tumor Organoid Modeling to Identify Bespoke Treatment Strategies for Glioblastoma. Cells10, 3400 (2021). - PMC - PubMed
    1. Roy, R., Winteringham, L. N., Lassmann, T. & Forrest, A. R. R. Expression Levels of Therapeutic Targets as Indicators of Sensitivity to Targeted Therapeutics. Mol. Cancer Ther.18, 2480–2489 (2019). - PubMed

LinkOut - more resources