Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr 17:5:180061.
doi: 10.1038/sdata.2018.61.

Unifying cancer and normal RNA sequencing data from different sources

Affiliations

Unifying cancer and normal RNA sequencing data from different sources

Qingguo Wang et al. Sci Data. .

Abstract

Driven by the recent advances of next generation sequencing (NGS) technologies and an urgent need to decode complex human diseases, a multitude of large-scale studies were conducted recently that have resulted in an unprecedented volume of whole transcriptome sequencing (RNA-seq) data, such as the Genotype Tissue Expression project (GTEx) and The Cancer Genome Atlas (TCGA). While these data offer new opportunities to identify the mechanisms underlying disease, the comparison of data from different sources remains challenging, due to differences in sample and data processing. Here, we developed a pipeline that processes and unifies RNA-seq data from different studies, which includes uniform realignment, gene expression quantification, and batch effect removal. We find that uniform alignment and quantification is not sufficient when combining RNA-seq data from different sources and that the removal of other batch effects is essential to facilitate data comparison. We have processed data from GTEx and TCGA and successfully corrected for study-specific biases, enabling comparative analysis between TCGA and GTEx. The normalized datasets are available for download on figshare.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1. Uniform processing of RNA-seq data from GTEx and TCGA.
Figure 2
Figure 2. Effect of uniform processing and batch effect removal on gene expression levels in GTEx and TCGA.
Two-dimensional plots are shown of principal components calculated by performing PCA of the gene expression values of bladder, prostate, and thyroid samples from GTEx and TCGA. (a) PCA of the level 3 data, i.e., the expression data from GTEx and TCGA. GTEx expression data was quantile normalized (see Supplementary Fig. S1B). (b) PCA of the expression data after uniform processing through our pipeline, before batch bias correction. (c) PCA of the expression data after uniform processing through our pipeline, after batch bias correction.
Figure 3
Figure 3. Hierarchical clustering of GTEx and TCGA bladder, prostate, and thyroid data shows the effect of uniform processing and batch effect correction.
(a) level 3 expression data from GTEx and TCGA; (b) gene expression calculated using our pipeline prior to batch bias correction; (c) our expression data after batch bias correction.
Figure 4
Figure 4. Normalized expression across tissue and cancer types for three known cancer genes: ERBB2, IGF2 and TP53.

Similar articles

Cited by

References

Data Citations

    1. Wang Q., Gao J., Nikolaus S. 2017. Figshare. https://doi.org/10.6084/m9.figshare.5330539 - DOI
    1. Wang Q., Gao J., Nikolaus S. 2017. Figshare. https://doi.org/10.6084/m9.figshare.5330575 - DOI
    1. Wang Q., Gao J., Nikolaus S. 2017. Figshare. https://doi.org/10.6084/m9.figshare.5330593 - DOI

References

    1. GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015). - PMC - PubMed
    1. GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013). - PMC - PubMed
    1. Petryszak R. et al. Expression Atlas update - a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments. Nucleic Acids Res. 42, 926–932 (2014). - PMC - PubMed
    1. Li J. R. et al. Cancer RNA-Seq Nexus: a database of phenotype-specific transcriptome profiling in cancer cells. Nucleic Acids Res. 44, D944–D951 (2016). - PMC - PubMed
    1. Sheng X. et al. MTD: a mammalian transcriptomic database to explore gene expression and regulation. Brief. Bioinform. 18, 28–36 (2017). - PMC - PubMed

Publication types