. 2020 Nov:61:103030.

doi: 10.1016/j.ebiom.2020.103030. Epub 2020 Oct 9.

CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence

Yue Zhao¹, Ziwei Pan², Sandeep Namburi¹, Andrew Pattison³, Atara Posner³, Shiva Balachander³, Carolyn A Paisie¹, Honey V Reddi⁴, Jens Rueter⁵, Anthony J Gill⁶, Stephen Fox⁷, Kanwal P S Raghav⁸, William F Flynn¹, Richard W Tothill⁹, Sheng Li¹⁰, R Krishna Murthy Karuturi¹¹, Joshy George¹²

Affiliations

¹ The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA.
² The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA; Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, USA.
³ Department of Clinical Pathology and Centre for Cancer Research, University of Melbourne, Parkville, Melbourne, Australia.
⁴ The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA; The Jackson Laboratory Cancer Center, Bar Harbor, ME, USA.
⁵ The Jackson Laboratory Cancer Center, Bar Harbor, ME, USA.
⁶ Cancer Diagnosis and Pathology Group, Kolling Institute of Medical Research, Royal North Shore Hospital, St Leonards, New South Wales 2065 Australia; NSW Health Pathology, Department of Anatomical Pathology, Royal North Shore Hospital, Sydney, New South Wales 2065 Australia; Department of Anatomical Pathology, Douglass Hanly Moir Pathology, Macquarie Park, New South Wales 2113 Australia; University of Sydney, Sydney, New South Wales 2006 Australia.
⁷ Peter MacCallum Cancer Centre, Department of Pathology, University of Melbourne, Victoria, Australia.
⁸ Department of Gastrointestinal Medical Oncology, Division of Cancer Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
⁹ Department of Clinical Pathology and Centre for Cancer Research, University of Melbourne, Parkville, Melbourne, Australia; Peter MacCallum Cancer Centre, Parkville, Melbourne, Australia. Electronic address: rtothill@unimelb.edu.au.
¹⁰ The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA; The Jackson Laboratory Cancer Center, Bar Harbor, ME, USA; Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, USA; Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA. Electronic address: sheng.li@jax.org.
¹¹ The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA; The Jackson Laboratory Cancer Center, Bar Harbor, ME, USA; Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA. Electronic address: krishna.karuturi@jax.org.
¹² The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA; The Jackson Laboratory Cancer Center, Bar Harbor, ME, USA. Electronic address: joshy.george@jax.org.

PMID: 33039710
PMCID: PMC7553237
DOI: 10.1016/j.ebiom.2020.103030

CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence

Yue Zhao et al. EBioMedicine. 2020 Nov.

. 2020 Nov:61:103030.

doi: 10.1016/j.ebiom.2020.103030. Epub 2020 Oct 9.

Authors

Affiliations

¹ The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA.
² The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA; Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, USA.
³ Department of Clinical Pathology and Centre for Cancer Research, University of Melbourne, Parkville, Melbourne, Australia.
⁴ The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA; The Jackson Laboratory Cancer Center, Bar Harbor, ME, USA.
⁵ The Jackson Laboratory Cancer Center, Bar Harbor, ME, USA.
⁶ Cancer Diagnosis and Pathology Group, Kolling Institute of Medical Research, Royal North Shore Hospital, St Leonards, New South Wales 2065 Australia; NSW Health Pathology, Department of Anatomical Pathology, Royal North Shore Hospital, Sydney, New South Wales 2065 Australia; Department of Anatomical Pathology, Douglass Hanly Moir Pathology, Macquarie Park, New South Wales 2113 Australia; University of Sydney, Sydney, New South Wales 2006 Australia.
⁷ Peter MacCallum Cancer Centre, Department of Pathology, University of Melbourne, Victoria, Australia.
⁸ Department of Gastrointestinal Medical Oncology, Division of Cancer Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
⁹ Department of Clinical Pathology and Centre for Cancer Research, University of Melbourne, Parkville, Melbourne, Australia; Peter MacCallum Cancer Centre, Parkville, Melbourne, Australia. Electronic address: rtothill@unimelb.edu.au.
¹⁰ The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA; The Jackson Laboratory Cancer Center, Bar Harbor, ME, USA; Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, USA; Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA. Electronic address: sheng.li@jax.org.
¹¹ The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA; The Jackson Laboratory Cancer Center, Bar Harbor, ME, USA; Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA. Electronic address: krishna.karuturi@jax.org.
¹² The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, USA; The Jackson Laboratory Cancer Center, Bar Harbor, ME, USA. Electronic address: joshy.george@jax.org.

PMID: 33039710
PMCID: PMC7553237
DOI: 10.1016/j.ebiom.2020.103030

Abstract

Background: Cancer of unknown primary (CUP), representing approximately 3-5% of all malignancies, is defined as metastatic cancer where a primary site of origin cannot be found despite a standard diagnostic workup. Because knowledge of a patient's primary cancer remains fundamental to their treatment, CUP patients are significantly disadvantaged and most have a poor survival outcome. Developing robust and accessible diagnostic methods for resolving cancer tissue of origin, therefore, has significant value for CUP patients.

Methods: We developed an RNA-based classifier called CUP-AI-Dx that utilizes a 1D Inception convolutional neural network (1D-Inception) model to infer a tumor's primary tissue of origin. CUP-AI-Dx was trained using the transcriptional profiles of 18,217 primary tumours representing 32 cancer types from The Cancer Genome Atlas project (TCGA) and International Cancer Genome Consortium (ICGC). Gene expression data was ordered by gene chromosomal coordinates as input to the 1D-CNN model, and the model utilizes multiple convolutional kernels with different configurations simultaneously to improve generality. The model was optimized through extensive hyperparameter tuning, including different max-pooling layers and dropout settings. For 11 tumour types, we also developed a random forest model that can classify the tumour's molecular subtype according to prior TCGA studies. The optimised CUP-AI-Dx tissue of origin classifier was tested on 394 metastatic samples from 11 tumour types from TCGA and 92 formalin-fixed paraffin-embedded (FFPE) samples representing 18 cancer types from two clinical laboratories. The CUP-AI-Dx molecular subtype was also independently tested on independent ovarian and breast cancer microarray datasets FINDINGS: CUP-AI-Dx identifies the primary site with an overall top-1-accuracy of 98.54% in cross-validation and 96.70% on a test dataset. When applied to two independent clinical-grade RNA-seq datasets generated from two different institutes from the US and Australia, our model predicted the primary site with a top-1-accuracy of 86.96% and 72.46% respectively.

Interpretation: The CUP-AI-Dx predicts tumour primary site and molecular subtype with high accuracy and therefore can be used to assist the diagnostic work-up of cancers of unknown primary or uncertain origin using a common and accessible genomics platform.

Funding: NIH R35 GM133562, NCI P30 CA034196, Victorian Cancer Agency Australia.

Keywords: Cancer; Cancer-of-unknown-primary; Cell-of-origin; Classification; Convolutional neural network; Deep learning; Inception model; Machine learning; TCGA.

PubMed Disclaimer

Figures

**Fig. 1**
Prediction workflow for primary tumour types and subtypes. (a) Schematic showing the learning procedure used to train the 1D-Inception model from labeled TCGA and ICGC transcriptomes spanning 32 cancer types for primary tumour type prediction. Models were trained with 70% training data and validated with 30% test data on normalized and standard scaled expression profiles. 817 features were selected (see Materials and methods). Primary tumour type classification performance was evaluated via cross-validation on the learning set of TCGA and ICGC primary tumour samples and external validation utilizing primary tumour types from transcriptomes of metastatic samples and clinical samples. (b) Illustration of 1D Inception Architecture optimized by Talos scanning on TCGA and ICGC dataset. Each rectangle represents a layer in the neural network. For convolutional layers, kernel size is shown, and the same kernel size layer is painted the same color. Max pooling layers are green rectangles with pooling window size inside. Dark grey rectangles are dropout layers with keep probability shown. The concatenation layer has a size of 1696 hidden nodes. This is determined by the output size from the convolutional layers. The bottom portion shows the output layer below two fully connected layers with 128 nodes individually. (c) Schematic showing the learning procedure used to train random forest (RF) models with 11 molecular subtypes for cancer subtype prediction. Models were trained and evaluated using 10-fold cross-validation on normalized and standard scaled expression profiles. N features were selected from each class (see Methods) and pooled for each fold to construct 11 molecular subtype predictors for random forest (RF). Cancer subtype classification performance was evaluated via cross-validation on the learning set and external validation utilizing breast and ovarian cancer datasets.

**Fig. 2**
Primary tumour type prediction performance of CNN models on the TCGA dataset. (a) Validation data cross-entropy loss of CNN models. One can observe that the training processes of all three models successfully converged. (b) Overall prediction accuracy of CNN models in cross-validation and external metastasis validation. (c) Per-class accuracy performance of CNN models.

**Fig. 3**
Cross- and external validation of primary tumour type predictor. The 1D-Inception model was constructed for primary tumour type prediction. 32 primary tumour types are grouped by the pan-organ system. (a) Inception model confusion matrix for cross-validation of 32 primary tumour types on TCGA and ICGC dataset. Accuracy for each prediction class is shown to the right of the table. (b) 394 expression profiles of TCGA metastatic tumours from the primary site of origin spanning 11 organs were classified by the primary tumour type predictor. (c) 23 expression profiles of clinical datasets spanning 6 cancer types were classified by primary tumour type predictor. (d) 69 expression profiles of Melbourne dataset spanning 18 cancer types were classified by primary tumour type predictor. Text in contingency table cell *c_j,i* of (b), and (c) shows the number of class i tumour samples classified as class j. The heatmap of the confusion matrix is coloured in grayscale. Colour shading along with the main diagonal shows pan-organ groups.

**Fig. 4**
Unsupervised embedding of expression profiles reveals relationships among primary sites. Expression profiles from all samples in the TCGA dataset were embedded into two dimensions using uniform manifold approximation and projection (UMAP) and colored by primary tumour type. For each cancer, labels are placed near the centroid of the expression profile in the UMAP latent space. Anatomical and histological relationships are emergent and add context to the most common misclassifications in Figure S2a. The following groups of cancers are highlighted with green, blue, and purple ellipses, respectively: i) COADREAD, STAD; ii) BLCA, CESC, ESCA, HNSC, LUSC; iii) GBM, LGG.

**Fig. 5**
Cross- and external validation of molecular subtype predictors. A predictor of molecular subtypes was constructed for each of 11 primary tumour types, spanning 38 molecular subtypes on the TCGA dataset. (a) Per-class accuracy, (b) specificity, and (c) sensitivity of molecular subtype classifications evaluated through cross-validation (Fig. 1c). To further validate these subtype predictors, ovarian (d) and breast (e) subtype predictors were used to predict the respective molecular subtypes in two external datasets (GSE9899 and EGAS00000000083, respectively).

See this image and copyright information in PMC

Cited by

Automatic origin prediction of liver metastases via hierarchical artificial-intelligence system trained on multiphasic CT data: a retrospective, multicentre study.
Xin H, Zhang Y, Lai Q, Liao N, Zhang J, Liu Y, Chen Z, He P, He J, Liu J, Zhou Y, Yang W, Zhou Y. Xin H, et al. EClinicalMedicine. 2024 Feb 1;69:102464. doi: 10.1016/j.eclinm.2024.102464. eCollection 2024 Mar. EClinicalMedicine. 2024. PMID: 38333364 Free PMC article.
Omics Data and Data Representations for Deep Learning-Based Predictive Modeling.
Tsimenidis S, Vrochidou E, Papakostas GA. Tsimenidis S, et al. Int J Mol Sci. 2022 Oct 14;23(20):12272. doi: 10.3390/ijms232012272. Int J Mol Sci. 2022. PMID: 36293133 Free PMC article. Review.
Generalising uncertainty improves accuracy and safety of deep learning analytics applied to oncology.
MacDonald S, Foley H, Yap M, Johnston RL, Steven K, Koufariotis LT, Sharma S, Wood S, Addala V, Pearson JV, Roosta F, Waddell N, Kondrashova O, Trzaskowski M. MacDonald S, et al. Sci Rep. 2023 May 6;13(1):7395. doi: 10.1038/s41598-023-31126-5. Sci Rep. 2023. PMID: 37149669 Free PMC article.
Validation of a Transcriptome-Based Assay for Classifying Cancers of Unknown Primary Origin.
Michuda J, Breschi A, Kapilivsky J, Manghnani K, McCarter C, Hockenberry AJ, Mineo B, Igartua C, Dudley JT, Stumpe MC, Beaubier N, Shirazi M, Jones R, Morency E, Blackwell K, Guinney J, Beauchamp KA, Taxter T. Michuda J, et al. Mol Diagn Ther. 2023 Jul;27(4):499-511. doi: 10.1007/s40291-023-00650-5. Epub 2023 Apr 26. Mol Diagn Ther. 2023. PMID: 37099070 Free PMC article.
OncoTrace-TOO: Interpretable Machine Learning Framework for Cancer Tissue-of-Origin Identification Using Transcriptomic Signatures.
Hao Y, Huang H, Huang D, Ruan J, Liu X, Zhang J. Hao Y, et al. Cancer Rep (Hoboken). 2025 Aug;8(8):e70311. doi: 10.1002/cnr2.70311. Cancer Rep (Hoboken). 2025. PMID: 40784724 Free PMC article.

See all "Cited by" articles

References

1. Varadhachary GR. Carcinoma of unknown primary origin. Gastrointest Cancer Res. 2007;1(6):229–235. - PMC - PubMed
1. Pavlidis N, Pentheroudakis G. Cancer of unknown primary site. Lancet. 2012;379(9824):1428–1435. - PubMed
1. Massard C, Loriot Y, Fizazi K. Carcinomas of an unknown primary origin–diagnosis and treatment. Nat Rev Clin Oncol. 2011;8(12):701–710. - PubMed
1. Qaseem A, Usman N, Jayaraj JS, Janapala RN, Kashif T. Cancer of unknown primary: a review on clinical guidelines in the development and targeted management of patients with the unknown primary site. Cureus. 2019;11(9):e5552. - PMC - PubMed
1. Varghese AM, Arora A, Capanu M, Camacho N, Won HH, Zehir A. Clinical and molecular characterization of patients with cancer of unknown primary in the modern era. Ann Oncol. 2017;28(12):3015–3021. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence

Affiliations

CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources