Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 3;18(1):508.
doi: 10.1186/s12864-017-3906-0.

A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data

Affiliations

A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data

Yuanyuan Li et al. BMC Genomics. .

Abstract

Background: The Cancer Genome Atlas (TCGA) has generated comprehensive molecular profiles. We aim to identify a set of genes whose expression patterns can distinguish diverse tumor types. Those features may serve as biomarkers for tumor diagnosis and drug development.

Methods: Using RNA-seq expression data, we undertook a pan-cancer classification of 9,096 TCGA tumor samples representing 31 tumor types. We randomly assigned 75% of samples into training and 25% into testing, proportionally allocating samples from each tumor type.

Results: We could correctly classify more than 90% of the test set samples. Accuracies were high for all but three of the 31 tumor types, in particular, for READ (rectum adenocarcinoma) which was largely indistinguishable from COAD (colon adenocarcinoma). We also carried out pan-cancer classification, separately for males and females, on 23 sex non-specific tumor types (those unrelated to reproductive organs). Results from these gender-specific analyses largely recapitulated results when gender was ignored. Remarkably, more than 80% of the 100 most discriminative genes selected from each gender separately overlapped. Genes that were differentially expressed between genders included BNC1, FAT2, FOXA1, and HOXA11. FOXA1 has been shown to play a role for sexual dimorphism in liver cancer. The differentially discriminative genes we identified might be important for the gender differences in tumor incidence and survival.

Conclusions: We were able to identify many sets of 20 genes that could correctly classify more than 90% of the samples from 31 different tumor types using TCGA RNA-seq data. This accuracy is remarkable given the number of the tumor types and the total number of samples involved. We achieved similar results when we analyzed 23 non-sex-specific tumor types separately for males and females. We regard the frequency with which a gene appeared in those sets as measuring its importance for tumor classification. One third of the 50 most frequently appearing genes were pseudogenes; the degree of enrichment may be indicative of their importance in tumor classification. Lastly, we identified a few genes that might play a role in sexual dimorphism in certain cancers.

Keywords: And sex dimorphism; Classification; Ga/KNN; Pan-cancer; RNA-seq; TCGA.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Proportion of test-set samples predicted to be each of the 31 tumor types. Y-axis lists the 31 actual tumor types; x-axis lists the 32 possible classification categories (31 tumor types plus “unclassified” [UC]). Each bar represents one of the 32 proportions that samples from the actual tumor type were predicted to be. The 32 plotted proportions represent means from the corresponding proportions for all samples of the actual tumor type
Fig. 2
Fig. 2
Stem plot of gene selection frequency based on 2000 near optimal gene selection classifiers from 1000 GA/KNN runs for each of two training/testing partitions
Fig. 3
Fig. 3
Heatmap representation of the expression patterns of the top 50 genes across all 9096 samples. Each row (gene) was centered by the median expression value across all samples. A hierarchical clustering analysis was carried out for both samples and genes using the Euclidean distance as the similarity metric
Fig. 4
Fig. 4
Boxplots of FOXA1 expression data in the 23 sex non-specific tumors from males (blue) and females (pink)

References

    1. Cancer Genome Atlas Research N Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM: the cancer genome Atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–1120. doi: 10.1038/ng.2764. - DOI - PMC - PubMed
    1. Ciriello G, Miller ML, Aksoy BA, Senbabaoglu Y, Schultz N, Sander C. Emerging landscape of oncogenic signatures across human cancers. Nat Genet. 2013;45(10):1127–1133. doi: 10.1038/ng.2762. - DOI - PMC - PubMed
    1. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–218. doi: 10.1038/nature12213. - DOI - PMC - PubMed
    1. Tamborero D, Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J, Kandoth C, Reimand J, Lawrence MS, Getz G, Bader GD, Ding L, et al. Comprehensive identification of mutational cancer driver genes across 12 tumor types. Sci Rep. 2013;3:2650. doi: 10.1038/srep02650. - DOI - PMC - PubMed
    1. Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, Xie M, Zhang Q, McMichael JF, Wyczalkowski MA, et al. Mutational landscape and significance across 12 major cancer types. Nature. 2013;502(7471):333–339. doi: 10.1038/nature12634. - DOI - PMC - PubMed

Publication types