Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May 17;23(1):117.
doi: 10.1186/s13059-022-02681-3.

Identifying common transcriptome signatures of cancer by interpreting deep learning models

Affiliations

Identifying common transcriptome signatures of cancer by interpreting deep learning models

Anupama Jha et al. Genome Biol. .

Abstract

Background: Cancer is a set of diseases characterized by unchecked cell proliferation and invasion of surrounding tissues. The many genes that have been genetically associated with cancer or shown to directly contribute to oncogenesis vary widely between tumor types, but common gene signatures that relate to core cancer pathways have also been identified. It is not clear, however, whether there exist additional sets of genes or transcriptomic features that are less well known in cancer biology but that are also commonly deregulated across several cancer types.

Results: Here, we agnostically identify transcriptomic features that are commonly shared between cancer types using 13,461 RNA-seq samples from 19 normal tissue types and 18 solid tumor types to train three feed-forward neural networks, based either on protein-coding gene expression, lncRNA expression, or splice junction use, to distinguish between normal and tumor samples. All three models recognize transcriptome signatures that are consistent across tumors. Analysis of attribution values extracted from our models reveals that genes that are commonly altered in cancer by expression or splicing variations are under strong evolutionary and selective constraints. Importantly, we find that genes composing our cancer transcriptome signatures are not frequently affected by mutations or genomic alterations and that their functions differ widely from the genes genetically associated with cancer.

Conclusions: Our results highlighted that deregulation of RNA-processing genes and aberrant splicing are pervasive features on which core cancer pathways might converge across a large array of solid tumor types.

Keywords: Cancer genomics; Deep learning; Transcriptomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
A Upset plot summarizing pairwise differential gene expression analyses performed on tumors and their corresponding normal tissue. No gene is significantly deregulated in more than 9 out of 11 cancer types tested. B, C Dataset assembled to train and test binary models to distinguish between normal and tumor samples shown by tissue type (B) or dataset (C). D Graphical representation of the computational framework used to train, test, and interpret the models. E Performance of models trained with protein-coding gene expression, lncRNA gene expression, or splicing variations evaluated by area under the precision-recall curve (AUPRC) and accuracy (sum of true positives and true negatives over the total population). F Accuracy of models trained with protein-coding gene expression, lncRNA gene expression, or splice junction usage across the 13 datasets used to assemble the training set. G Performance of models trained with protein-coding gene expression, lncRNA gene expression, or splice junction usage on an independent dataset consisting of normal and cancer lung samples. H Performance of the deep learning model, SVM, and random forest using protein-coding gene expression on unseen tissue types (blood cancers) with no batch correction. The training set consists of solid tumors only
Fig. 2
Fig. 2
A Selection of high-attribution features from models trained with protein-coding gene expression, lncRNA gene expression or splice junction usage. Dotted lines show cutoffs used; purple points around coordinates (0,0) show features selected in neutral sets. B Median attribution values of 100 protein-coding genes, lncRNAs or splice junctions with the highest positive and negative attributions across tumor tissues. C Median attribution values of genes associated with cancer from the COSMIC database. D Overlap between COSMIC oncogenes and TSGs, and genes with high positive (top panel) or negative (bottom panel) attribution values, or between COSMIC tier 1 genes (high confidence for causal role in cancer) and tier 2 genes (some evidence of causal involvement in cancer), and all high-attribution genes (central panel). E Overlap between genes associated with cancer and genes harboring junctions with high attribution values. In both D and E, enrichment or depletion factors were calculated from the ratio of observed vs. expected overlapping genes between sets, and p-values were calculated using the hypergeometric test
Fig. 3
Fig. 3
Fraction of genes with driver mutations (A), frequency of passenger mutations (B), structural variants (C), amplification events (D), or deletion events (E) in the TCGA cohort. The analysis was performed on the following sets of genes: COSMIC oncogenes (yellow), COSMIC tumor-suppressor genes (TSGs, green), genes with high positive (“protein-coding positive,” red), high negative (“protein-coding negative,” blue), or neutral attribution values by expression (“protein-coding neutral,” gray) or genes with splice junctions with high (“splicing high,” light red) or neutral (“splicing neutral,” light gray) attribution values in our models. p-values were calculated using a one-way ANOVA with Tukey post hoc tests
Fig. 4
Fig. 4
A Evolutionary conservation of protein-coding genes (left panel), genes with variable splice junctions (middle panel) or lncRNA genes (right panel) across human, chimpanzee, mouse, cattle, xenopus, zebrafish, and chicken. B Gene length derived from the longest annotated transcript in Ensembl (hg38) for protein-coding genes (left panel), genes with variable splice junctions (middle panel), or lncRNA genes (right panel). C Selective pressure against loss-of-function mutations in the human population as assessed by gnomAD LOEUF score [26], showing score for protein-coding genes (left panel) or genes with splice junctions (right panel) with high attribution values. A low LOEUF score implies high selective pressure against loss of function. D Pyknon density in lncRNA genes with high attribution values. All p-values shown are calculated using an unpaired t-test
Fig. 5
Fig. 5
Protein domains that are affected by at least two splice junctions with high attributions
Fig. 6
Fig. 6
A Overlap between genes with high positive (red set) or negative (blue set) attribution values and genes with splice junctions that have high attribution values (light red set) with the list of genes overlapping between sets. B Enrichment map showing GO terms related to biological processes that are enriched among protein-coding genes with high negative (protein-coding negative, blue) or positive (protein-coding positive, red) attributions by expression, genes with high-attribution splice junctions (splice junctions, light red), or enriched both in protein-coding positive and splice junction genes (protein-coding positive and splice junctions, dark red). Each node is a GO term and the color of the nodes corresponds to gene sets in which they are enriched. The thickness of edges corresponds to the number of genes in common between GO terms. C Heatmap of terms obtained from an Ingenuity Pathway Analysis (QIAGEN) for molecular and cellular functions. D Gene set enrichment analysis on high-attribution protein-coding genes by expression showing an enrichment in KRAS signaling among negative attribution genes. High-attribution genes were ranked on the x-axis from high positive (left, red) to high negative attribution (right, blue)
Fig. 7
Fig. 7
Network analysis of common cancer pathways (PI3K, cell cycle, Myc, Rtk-Ras, Notch, Hippo, TP53, Hippo, TGF-beta) together with genes in GO terms related to RNA regulation or RNA processing that are enriched in our protein-coding or splicing models. Each node is a network as predicted with Ingenuity Pathway Analysis; the size of the nodes represents the number of molecules in each network and the thickness of the edges represents the number of molecules in common between two networks. Networks formed by high-attribution genes in our protein-coding and splicing models are highlighted with a thicker node border. Only networks comprising at least three molecules and connected by at least two shared molecules are shown. Numbers identify networks; Additional file 2: Table S13 lists the molecules found in each network (node)

Similar articles

Cited by

References

    1. Haigis KM, Cichowski K, Elledge SJ. Tissue-specificity in cancer: the rule, not the exception. Science. 2019;363(6432):1150–1. doi: 10.1126/science.aaw3472. - DOI - PubMed
    1. Sanchez-Vega F, Mina M, Armenia J, Chatila WK, Luna A, La KC, Dimitriadoy S, Liu DL, Kantheti HS, Saghafinia S, et al. Oncogenic signaling pathways in the cancer genome atlas. Cell. 2018;173(2):321–37. doi: 10.1016/j.cell.2018.03.035. - DOI - PMC - PubMed
    1. Paull EO, Aytes A, Jones SJ, Subramaniam PS, Giorgi FM, Douglass EF, Tagore S, Chu B, Vasciaveo A, Zheng S, Verhaak R, Abate-Shen C, Alvarez MJ, Califano A. A modular master regulator landscape controls cancer transcriptional identity. Cell. 2021;184(2):334–35120. doi: 10.1016/j.cell.2020.11.045. - DOI - PMC - PubMed
    1. Dave SS, Wright G, Tan B, Rosenwald A, Gascoyne RD, Chan WC, Fisher RI, Braziel RM, Rimsza LM, Grogan TM, et al. Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. N Engl J Med. 2004;351(21):2159–69. doi: 10.1056/NEJMoa041869. - DOI - PubMed
    1. Roessler S, Jia H-L, Budhu A, Forgues M, Ye Q-H, Lee J-S, Thorgeirsson SS, Sun Z, Tang Z-Y, Qin L-X, et al. A unique metastasis gene signature enables prediction of tumor relapse in early-stage hepatocellular carcinoma patients. Cancer Res. 2010;70(24):10202–12. doi: 10.1158/0008-5472.CAN-10-2607. - DOI - PMC - PubMed

Publication types