Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 3;14(6):1064-1081.
doi: 10.1158/2159-8290.CD-23-0996.

Deep-Learning Model for Tumor-Type Prediction Using Targeted Clinical Genomic Sequencing Data

Affiliations

Deep-Learning Model for Tumor-Type Prediction Using Targeted Clinical Genomic Sequencing Data

Madison Darmofal et al. Cancer Discov. .

Abstract

Tumor type guides clinical treatment decisions in cancer, but histology-based diagnosis remains challenging. Genomic alterations are highly diagnostic of tumor type, and tumor-type classifiers trained on genomic features have been explored, but the most accurate methods are not clinically feasible, relying on features derived from whole-genome sequencing (WGS), or predicting across limited cancer types. We use genomic features from a data set of 39,787 solid tumors sequenced using a clinically targeted cancer gene panel to develop Genome-Derived-Diagnosis Ensemble (GDD-ENS): a hyperparameter ensemble for classifying tumor type using deep neural networks. GDD-ENS achieves 93% accuracy for high-confidence predictions across 38 cancer types, rivaling the performance of WGS-based methods. GDD-ENS can also guide diagnoses of rare type and cancers of unknown primary and incorporate patient-specific clinical information for improved predictions. Overall, integrating GDD-ENS into prospective clinical sequencing workflows could provide clinically relevant tumor-type predictions to guide treatment decisions in real time.

Significance: We describe a highly accurate tumor-type prediction model, designed specifically for clinical implementation. Our model relies only on widely used cancer gene panel sequencing data, predicts across 38 distinct cancer types, and supports integration of patient-specific nongenomic information for enhanced decision support in challenging diagnostic situations. See related commentary by Garg, p. 906. This article is featured in Selected Articles from This Issue, p. 897.

PubMed Disclaimer

Figures

Figure 1. Overview of GDD-ENS model. A, Cohort diagram, detailing samples used to form a training and testing cohort. Hematologic refers to blood-based cancers (i.e., leukemias) sequenced using MSK-IMPACT before the development of a separate nonsolid tumor assay. B, Training set distribution of cancer types included after expanding GDD-RF to GDD-ENS, colored by model inclusion. Any type with fewer than 350 examples is upsampled via replacement during training. NSCLC, non–small cell lung cancer; GIST, gastrointestinal stromal tumor; SQC, squamous cell carcinoma; SCLC, small cell lung cancer; PNET, pancreatic neuroendocrine tumor; Lu-NET, lung neuroendocrine tumor; GI-NET, gastrointestinal neuroendocrine tumor; Carc., carcinoma; MPNST, malignant peripheral nerve sheath tumor. C, Distribution of informative feature types. “Other” refers to clinical features or numerical features representing overall mutational burden across categories. Allele-specific hotspot features annotate hotspot mutation (i.e., KRAS G12C), whereas gene-specific hotspot features only specify the gene altered (KRAS hotspot). CNAs, copy-number alterations. D, GDD-ENS workflow from patient to output. All nonclinical features are derived from the MSK-IMPACT sequencing assay and then fed into GDD-ENS. GDD-ENS reports the top three tumor-type predictions with confidence estimates, along with 10 most important features for the top prediction on a patient-specific basis. Workflow overview created with BioRender (https://www.biorender.com/).
Figure 1.
Overview of GDD-ENS model. A, Cohort diagram, detailing samples used to form a training and testing cohort. Hematologic refers to blood-based cancers (i.e., leukemias) sequenced using MSK-IMPACT before the development of a separate nonsolid tumor assay. B, Training set distribution of cancer types included after expanding GDD-RF to GDD-ENS, colored by model inclusion. Any type with fewer than 350 examples is upsampled via replacement during training. NSCLC, non–small cell lung cancer; GIST, gastrointestinal stromal tumor; SQC, squamous cell carcinoma; SCLC, small cell lung cancer; PNET, pancreatic neuroendocrine tumor; Lu-NET, lung neuroendocrine tumor; GI-NET, gastrointestinal neuroendocrine tumor; Carc., carcinoma; MPNST, malignant peripheral nerve sheath tumor. C, Distribution of informative feature types. “Other” refers to clinical features or numerical features representing overall mutational burden across categories. Allele-specific hotspot features annotate hotspot mutation (i.e., KRAS G12C), whereas gene-specific hotspot features only specify the gene altered (KRAS hotspot). CNAs, copy-number alterations. D, GDD-ENS workflow from patient to output. All nonclinical features are derived from the MSK-IMPACT sequencing assay and then fed into GDD-ENS. GDD-ENS reports the top three tumor-type predictions with confidence estimates, along with 10 most important features for the top prediction on a patient-specific basis. Workflow overview created with BioRender (https://www.biorender.com/).
Figure 2. GDD-ENS performance across cancer types. A, Row-normalized confusion matrix of high-confidence predictions across cancer types. Off-diagonal values correspond to the proportion of the row, predicted cancer type that is a column, true type. The true prostate cancer predicted ependymoma off-diagonal value indicates a single prostate cancer sample that was predicted as ependymoma, likely due to low tumor burden as it just passed the inclusion thresholds for tumor content. Due to rounding, some rows may not sum to 100. Top bar graphs indicate test set count (top) and type-specific recall across confidence levels. The left bar graphs represent high-confidence positive predictive value (high conf. PPV, right) and type-specific precision. Sorted by overall type precision on both axes. B, Calibration plot for GDD-single models and GDD-ENS ensemble. The X-axis represents maximum confidence after binning outputs into five equally sized ranges, and the Y-axis represents the overall accuracy of all predictions within that range. The blue line represents the GDD-ENS model; dark gray line and shaded regions represent mean GDD-Single accuracy and 95% confidence interval, respectively. Expected calibration error (ECE) calculated as per Methods. C, Shapley value score distributions for correct GDD-ENS predictions across individual types (left, middle) and all predictions (right). Importance score proportion represents the proportion of total Shapley value scores after summing the absolute Shapley value per feature across the specified subset, normalized as per Methods.
Figure 2.
GDD-ENS performance across cancer types. A, Row-normalized confusion matrix of high-confidence predictions across cancer types. Off-diagonal values correspond to the proportion of the row, predicted cancer type that is a column, true type. The true prostate cancer predicted ependymoma off-diagonal value indicates a single prostate cancer sample that was predicted as ependymoma, likely due to low tumor burden as it just passed the inclusion thresholds for tumor content. Due to rounding, some rows may not sum to 100. Top bar graphs indicate test set count (top) and type-specific recall across confidence levels. The left bar graphs represent high-confidence positive predictive value (high conf. PPV, right) and type-specific precision. Sorted by overall type precision on both axes. B, Calibration plot for GDD-single models and GDD-ENS ensemble. The X-axis represents maximum confidence after binning outputs into five equally sized ranges, and the Y-axis represents the overall accuracy of all predictions within that range. The blue line represents the GDD-ENS model; dark gray line and shaded regions represent mean GDD-Single accuracy and 95% confidence interval, respectively. Expected calibration error (ECE) calculated as per Methods. C, Shapley value score distributions for correct GDD-ENS predictions across individual types (left, middle) and all predictions (right). Importance score proportion represents the proportion of total Shapley value scores after summing the absolute Shapley value per feature across the specified subset, normalized as per Methods.
Figure 3. GDD-ENS performance on excluded cancer samples. A, Empirical cumulative distribution function comparing the probability of in-distribution test set and excluded samples (left). 72% of the in-distribution test set are high confidence, compared with only 36% of excluded samples. The relative fraction of excluded samples among the entire discovery cohort indicates the vast majority of solid tumors are in-distribution (right); conf., confidence. B, Row-normalized confusion matrix of the organ system of the true type for excluded samples vs. the high-confidence GDD-ENS prediction organ system. Colors represent broad organ systems annotated, many predictions are conserved within correct organ systems. Rows and columns are ordered by the total number of excluded type samples within each organ system (right), and the number of conserved predictions for each cancer type (top). GDD-ENS predictions correspond to types from the same organ system for 290/440 excluded samples with specific organ system annotations (66%).
Figure 3.
GDD-ENS performance on excluded cancer samples. A, Empirical cumulative distribution function comparing the probability of in-distribution test set and excluded samples (left). 72% of the in-distribution test set are high confidence, compared with only 36% of excluded samples. The relative fraction of excluded samples among the entire discovery cohort indicates the vast majority of solid tumors are in-distribution (right); conf., confidence. B, Row-normalized confusion matrix of the organ system of the true type for excluded samples vs. the high-confidence GDD-ENS prediction organ system. Colors represent broad organ systems annotated, many predictions are conserved within correct organ systems. Rows and columns are ordered by the total number of excluded type samples within each organ system (right), and the number of conserved predictions for each cancer type (top). GDD-ENS predictions correspond to types from the same organ system for 290/440 excluded samples with specific organ system annotations (66%).
Figure 4. Adaptable prior distribution enables the incorporation of nongenomic information for enhanced predictions. A, Proportion of all in-distribution discovery cohort samples that are either primary or metastatic (top) or have broad histologic annotations (middle, top) per cancer type. Underlying distributions of metastatic site (middle) and histology (bottom) for all in-distribution discovery cohort samples across 19 metastatic sites and 2 histologic subtypes. Heatmaps are row normalized but only show 10 types (full heatmap in Supplementary Fig. S11). Met., metastatic; Prim., primary; Adeno., adenocarcinoma; SQC, squamous cell carcinoma. B, Overview of adaptable prior methodology. C, Flow of results for combination prior using both metastatic site and histology for all test set examples. Arrow base represents preadjustment category; arrowhead represents post-adjustment. Circle arrows indicate the number of samples that did not change categories after adjustment, i.e., 4,618 samples that were correct and high confidence before and after applying the prior. D, Walkthrough of a patient with head and neck squamous cell cancer predicted bladder by GDD-ENS with 0.85 confidence. After applying priors specific to the site of metastasis and annotated histology, the sample was correctly predicted with high confidence (0.96). Overview and flow diagram created with BioRender (https://biorender.com/).
Figure 4.
Adaptable prior distribution enables the incorporation of nongenomic information for enhanced predictions. A, Proportion of all in-distribution discovery cohort samples that are either primary or metastatic (top) or have broad histologic annotations (middle, top) per cancer type. Underlying distributions of metastatic site (middle) and histology (bottom) for all in-distribution discovery cohort samples across 19 metastatic sites and 2 histologic subtypes. Heatmaps are row normalized but only show 10 types (full heatmap in Supplementary Fig. S11). Met., metastatic; Prim., primary; Adeno., adenocarcinoma; SQC, squamous cell carcinoma. B, Overview of adaptable prior methodology. C, Flow of results for combination prior using both metastatic site and histology for all test set examples. Arrow base represents preadjustment category; arrowhead represents post-adjustment. Circle arrows indicate the number of samples that did not change categories after adjustment, i.e., 4,618 samples that were correct and high confidence before and after applying the prior. D, Walkthrough of a patient with head and neck squamous cell cancer predicted bladder by GDD-ENS with 0.85 confidence. After applying priors specific to the site of metastasis and annotated histology, the sample was correctly predicted with high confidence (0.96). Overview and flow diagram created with BioRender (https://biorender.com/).
Figure 5. GDD-ENS predictions on CUP patients can identify targetable alterations. A, Distribution of GDD-ENS predictions for all 1,441 CUP patients. B, Bar plot indicating the total number of patients with targetable alterations after GDD-ENS predictions (bottom axis). Only 550 CUP patients had potentially actionable alterations at level 3B or higher in CUP patients before GDD-ENS predictions; the top axis indicates the overall proportion of these patients with identified alterations. C, Most frequently identified targetable alteration for only high-confidence GDD-ENS predictions. Targetable alterations representing a specific allele change or structural variant are annotated as such; otherwise, counts represent a combination of broad oncogenic mutations across multiple alteration types and allele changes.
Figure 5.
GDD-ENS predictions on CUP patients can identify targetable alterations. A, Distribution of GDD-ENS predictions for all 1,441 CUP patients. B, Bar plot indicating the total number of patients with targetable alterations after GDD-ENS predictions (bottom axis). Only 550 CUP patients had potentially actionable alterations at level 3B or higher in CUP patients before GDD-ENS predictions; the top axis indicates the overall proportion of these patients with identified alterations. C, Most frequently identified targetable alteration for only high-confidence GDD-ENS predictions. Targetable alterations representing a specific allele change or structural variant are annotated as such; otherwise, counts represent a combination of broad oncogenic mutations across multiple alteration types and allele changes.

Update of

References

    1. Pavlidis N, Briasoulis E, Hainsworth J, Greco FA. Diagnostic and therapeutic management of cancer of an unknown primary. Eur J Cancer 2003;39:1990–2005. - PubMed
    1. Varghese AM, Arora A, Capanu M, Camacho N, Won HH, Zehir A, et al. . Clinical and molecular characterization of patients with cancer of unknown primary in the modern era. Ann Oncol 2017;28:3015–21. - PMC - PubMed
    1. Kato S, Alsafar A, Walavalkar V, Hainsworth J, Kurzrock R. Cancer of unknown primary in the molecular Era. Trends Cancer 2021;7:465–77. - PMC - PubMed
    1. Greco FA. Molecular diagnosis of the tissue of origin in cancer of unknown primary site: useful in patient management. Curr Treat Options Oncol 2013;14:634–42. - PubMed
    1. Zehir A, Benayed R, Shah RH, Syed A, Middha S, Kim HR, et al. . Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat Med 2017;23:703–13. - PMC - PubMed