. 2024 Jun 3;14(6):1064-1081.

doi: 10.1158/2159-8290.CD-23-0996.

Deep-Learning Model for Tumor-Type Prediction Using Targeted Clinical Genomic Sequencing Data

Madison Darmofal^{1

2}, Shalabh Suman³, Gurnit Atwal^{4

5

6}, Michael Toomey^{1

2}, Jie-Fu Chen³, Jason C Chang³, Efsevia Vakiani³, Anna M Varghese⁷, Anoop Balakrishnan Rema³, Aijazuddin Syed³, Nikolaus Schultz^{8

9

10}, Michael F Berger^#^{3

8

9}, Quaid Morris^#¹

Affiliations

¹ Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York.
² Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, New York.
³ Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York.
⁴ Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada.
⁵ Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.
⁶ Vector Institute, Toronto, Ontario, Canada.
⁷ Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, New York.
⁸ Marie-Josée and Henry R. Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, New York.
⁹ Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, New York.
¹⁰ Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York.

^# Contributed equally.

PMID: 38416134
PMCID: PMC11145170
DOI: 10.1158/2159-8290.CD-23-0996

Deep-Learning Model for Tumor-Type Prediction Using Targeted Clinical Genomic Sequencing Data

Madison Darmofal et al. Cancer Discov. 2024.

. 2024 Jun 3;14(6):1064-1081.

doi: 10.1158/2159-8290.CD-23-0996.

Authors

Affiliations

¹ Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York.
² Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, New York.
³ Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York.
⁴ Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada.
⁵ Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.
⁶ Vector Institute, Toronto, Ontario, Canada.
⁷ Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, New York.
⁸ Marie-Josée and Henry R. Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, New York.
⁹ Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, New York.
¹⁰ Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York.

^# Contributed equally.

PMID: 38416134
PMCID: PMC11145170
DOI: 10.1158/2159-8290.CD-23-0996

Abstract

Tumor type guides clinical treatment decisions in cancer, but histology-based diagnosis remains challenging. Genomic alterations are highly diagnostic of tumor type, and tumor-type classifiers trained on genomic features have been explored, but the most accurate methods are not clinically feasible, relying on features derived from whole-genome sequencing (WGS), or predicting across limited cancer types. We use genomic features from a data set of 39,787 solid tumors sequenced using a clinically targeted cancer gene panel to develop Genome-Derived-Diagnosis Ensemble (GDD-ENS): a hyperparameter ensemble for classifying tumor type using deep neural networks. GDD-ENS achieves 93% accuracy for high-confidence predictions across 38 cancer types, rivaling the performance of WGS-based methods. GDD-ENS can also guide diagnoses of rare type and cancers of unknown primary and incorporate patient-specific clinical information for improved predictions. Overall, integrating GDD-ENS into prospective clinical sequencing workflows could provide clinically relevant tumor-type predictions to guide treatment decisions in real time.

Significance: We describe a highly accurate tumor-type prediction model, designed specifically for clinical implementation. Our model relies only on widely used cancer gene panel sequencing data, predicts across 38 distinct cancer types, and supports integration of patient-specific nongenomic information for enhanced decision support in challenging diagnostic situations. See related commentary by Garg, p. 906. This article is featured in Selected Articles from This Issue, p. 897.

PubMed Disclaimer

Figures

Figure 1. Overview of GDD-ENS model. A, Cohort diagram, detailing samples used to form a training and testing cohort. Hematologic refers to blood-based cancers (i.e., leukemias) sequenced using MSK-IMPACT before the development of a separate nonsolid tumor assay. B, Training set distribution of cancer types included after expanding GDD-RF to GDD-ENS, colored by model inclusion. Any type with fewer than 350 examples is upsampled via replacement during training. NSCLC, non–small cell lung cancer; GIST, gastrointestinal stromal tumor; SQC, squamous cell carcinoma; SCLC, small cell lung cancer; PNET, pancreatic neuroendocrine tumor; Lu-NET, lung neuroendocrine tumor; GI-NET, gastrointestinal neuroendocrine tumor; Carc., carcinoma; MPNST, malignant peripheral nerve sheath tumor. C, Distribution of informative feature types. “Other” refers to clinical features or numerical features representing overall mutational burden across categories. Allele-specific hotspot features annotate hotspot mutation (i.e., KRAS G12C), whereas gene-specific hotspot features only specify the gene altered (KRAS hotspot). CNAs, copy-number alterations. D, GDD-ENS workflow from patient to output. All nonclinical features are derived from the MSK-IMPACT sequencing assay and then fed into GDD-ENS. GDD-ENS reports the top three tumor-type predictions with confidence estimates, along with 10 most important features for the top prediction on a patient-specific basis. Workflow overview created with BioRender (https://www.biorender.com/). — **Figure 1.**
Overview of GDD-ENS model. A, Cohort diagram, detailing samples used to form a training and testing cohort. Hematologic refers to blood-based cancers (i.e., leukemias) sequenced using MSK-IMPACT before the development of a separate nonsolid tumor assay. B, Training set distribution of cancer types included after expanding GDD-RF to GDD-ENS, colored by model inclusion. Any type with fewer than 350 examples is upsampled via replacement during training. NSCLC, non–small cell lung cancer; GIST, gastrointestinal stromal tumor; SQC, squamous cell carcinoma; SCLC, small cell lung cancer; PNET, pancreatic neuroendocrine tumor; Lu-NET, lung neuroendocrine tumor; GI-NET, gastrointestinal neuroendocrine tumor; Carc., carcinoma; MPNST, malignant peripheral nerve sheath tumor. C, Distribution of informative feature types. “Other” refers to clinical features or numerical features representing overall mutational burden across categories. Allele-specific hotspot features annotate hotspot mutation (i.e., *KRAS* G12C), whereas gene-specific hotspot features only specify the gene altered (*KRAS* hotspot). CNAs, copy-number alterations. D, GDD-ENS workflow from patient to output. All nonclinical features are derived from the MSK-IMPACT sequencing assay and then fed into GDD-ENS. GDD-ENS reports the top three tumor-type predictions with confidence estimates, along with 10 most important features for the top prediction on a patient-specific basis. Workflow overview created with BioRender (https://www.biorender.com/).

Figure 2. GDD-ENS performance across cancer types. A, Row-normalized confusion matrix of high-confidence predictions across cancer types. Off-diagonal values correspond to the proportion of the row, predicted cancer type that is a column, true type. The true prostate cancer predicted ependymoma off-diagonal value indicates a single prostate cancer sample that was predicted as ependymoma, likely due to low tumor burden as it just passed the inclusion thresholds for tumor content. Due to rounding, some rows may not sum to 100. Top bar graphs indicate test set count (top) and type-specific recall across confidence levels. The left bar graphs represent high-confidence positive predictive value (high conf. PPV, right) and type-specific precision. Sorted by overall type precision on both axes. B, Calibration plot for GDD-single models and GDD-ENS ensemble. The X-axis represents maximum confidence after binning outputs into five equally sized ranges, and the Y-axis represents the overall accuracy of all predictions within that range. The blue line represents the GDD-ENS model; dark gray line and shaded regions represent mean GDD-Single accuracy and 95% confidence interval, respectively. Expected calibration error (ECE) calculated as per Methods. C, Shapley value score distributions for correct GDD-ENS predictions across individual types (left, middle) and all predictions (right). Importance score proportion represents the proportion of total Shapley value scores after summing the absolute Shapley value per feature across the specified subset, normalized as per Methods. — **Figure 2.**
GDD-ENS performance across cancer types. A, Row-normalized confusion matrix of high-confidence predictions across cancer types. Off-diagonal values correspond to the proportion of the row, predicted cancer type that is a column, true type. The true prostate cancer predicted ependymoma off-diagonal value indicates a single prostate cancer sample that was predicted as ependymoma, likely due to low tumor burden as it just passed the inclusion thresholds for tumor content. Due to rounding, some rows may not sum to 100. Top bar graphs indicate test set count (top) and type-specific recall across confidence levels. The left bar graphs represent high-confidence positive predictive value (high conf. PPV, right) and type-specific precision. Sorted by overall type precision on both axes. B, Calibration plot for GDD-single models and GDD-ENS ensemble. The X-axis represents maximum confidence after binning outputs into five equally sized ranges, and the Y-axis represents the overall accuracy of all predictions within that range. The blue line represents the GDD-ENS model; dark gray line and shaded regions represent mean GDD-Single accuracy and 95% confidence interval, respectively. Expected calibration error (ECE) calculated as per Methods. C, Shapley value score distributions for correct GDD-ENS predictions across individual types (left, middle) and all predictions (right). Importance score proportion represents the proportion of total Shapley value scores after summing the absolute Shapley value per feature across the specified subset, normalized as per Methods.

Figure 3. GDD-ENS performance on excluded cancer samples. A, Empirical cumulative distribution function comparing the probability of in-distribution test set and excluded samples (left). 72% of the in-distribution test set are high confidence, compared with only 36% of excluded samples. The relative fraction of excluded samples among the entire discovery cohort indicates the vast majority of solid tumors are in-distribution (right); conf., confidence. B, Row-normalized confusion matrix of the organ system of the true type for excluded samples vs. the high-confidence GDD-ENS prediction organ system. Colors represent broad organ systems annotated, many predictions are conserved within correct organ systems. Rows and columns are ordered by the total number of excluded type samples within each organ system (right), and the number of conserved predictions for each cancer type (top). GDD-ENS predictions correspond to types from the same organ system for 290/440 excluded samples with specific organ system annotations (66%). — **Figure 3.**
GDD-ENS performance on excluded cancer samples. A, Empirical cumulative distribution function comparing the probability of in-distribution test set and excluded samples (left). 72% of the in-distribution test set are high confidence, compared with only 36% of excluded samples. The relative fraction of excluded samples among the entire discovery cohort indicates the vast majority of solid tumors are in-distribution (right); conf., confidence. B, Row-normalized confusion matrix of the organ system of the true type for excluded samples vs. the high-confidence GDD-ENS prediction organ system. Colors represent broad organ systems annotated, many predictions are conserved within correct organ systems. Rows and columns are ordered by the total number of excluded type samples within each organ system (right), and the number of conserved predictions for each cancer type (top). GDD-ENS predictions correspond to types from the same organ system for 290/440 excluded samples with specific organ system annotations (66%).

Figure 4. Adaptable prior distribution enables the incorporation of nongenomic information for enhanced predictions. A, Proportion of all in-distribution discovery cohort samples that are either primary or metastatic (top) or have broad histologic annotations (middle, top) per cancer type. Underlying distributions of metastatic site (middle) and histology (bottom) for all in-distribution discovery cohort samples across 19 metastatic sites and 2 histologic subtypes. Heatmaps are row normalized but only show 10 types (full heatmap in Supplementary Fig. S11). Met., metastatic; Prim., primary; Adeno., adenocarcinoma; SQC, squamous cell carcinoma. B, Overview of adaptable prior methodology. C, Flow of results for combination prior using both metastatic site and histology for all test set examples. Arrow base represents preadjustment category; arrowhead represents post-adjustment. Circle arrows indicate the number of samples that did not change categories after adjustment, i.e., 4,618 samples that were correct and high confidence before and after applying the prior. D, Walkthrough of a patient with head and neck squamous cell cancer predicted bladder by GDD-ENS with 0.85 confidence. After applying priors specific to the site of metastasis and annotated histology, the sample was correctly predicted with high confidence (0.96). Overview and flow diagram created with BioRender (https://biorender.com/). — **Figure 4.**
Adaptable prior distribution enables the incorporation of nongenomic information for enhanced predictions. A, Proportion of all in-distribution discovery cohort samples that are either primary or metastatic (top) or have broad histologic annotations (middle, top) per cancer type. Underlying distributions of metastatic site (middle) and histology (bottom) for all in-distribution discovery cohort samples across 19 metastatic sites and 2 histologic subtypes. Heatmaps are row normalized but only show 10 types (full heatmap in Supplementary Fig. S11). Met., metastatic; Prim., primary; Adeno., adenocarcinoma; SQC, squamous cell carcinoma. B, Overview of adaptable prior methodology. C, Flow of results for combination prior using both metastatic site and histology for all test set examples. Arrow base represents preadjustment category; arrowhead represents post-adjustment. Circle arrows indicate the number of samples that did not change categories after adjustment, i.e., 4,618 samples that were correct and high confidence before and after applying the prior. D, Walkthrough of a patient with head and neck squamous cell cancer predicted bladder by GDD-ENS with 0.85 confidence. After applying priors specific to the site of metastasis and annotated histology, the sample was correctly predicted with high confidence (0.96). Overview and flow diagram created with BioRender (https://biorender.com/).

Figure 5. GDD-ENS predictions on CUP patients can identify targetable alterations. A, Distribution of GDD-ENS predictions for all 1,441 CUP patients. B, Bar plot indicating the total number of patients with targetable alterations after GDD-ENS predictions (bottom axis). Only 550 CUP patients had potentially actionable alterations at level 3B or higher in CUP patients before GDD-ENS predictions; the top axis indicates the overall proportion of these patients with identified alterations. C, Most frequently identified targetable alteration for only high-confidence GDD-ENS predictions. Targetable alterations representing a specific allele change or structural variant are annotated as such; otherwise, counts represent a combination of broad oncogenic mutations across multiple alteration types and allele changes. — **Figure 5.**
GDD-ENS predictions on CUP patients can identify targetable alterations. A, Distribution of GDD-ENS predictions for all 1,441 CUP patients. B, Bar plot indicating the total number of patients with targetable alterations after GDD-ENS predictions (bottom axis). Only 550 CUP patients had potentially actionable alterations at level 3B or higher in CUP patients before GDD-ENS predictions; the top axis indicates the overall proportion of these patients with identified alterations. C, Most frequently identified targetable alteration for only high-confidence GDD-ENS predictions. Targetable alterations representing a specific allele change or structural variant are annotated as such; otherwise, counts represent a combination of broad oncogenic mutations across multiple alteration types and allele changes.

See this image and copyright information in PMC

Update of

Deep Learning Model for Tumor Type Prediction using Targeted Clinical Genomic Sequencing Data.
Darmofal M, Suman S, Atwal G, Chen JF, Chang JC, Toomey M, Vakiani E, Varghese AM, Rema AB, Syed A, Schultz N, Berger M, Morris Q. Darmofal M, et al. medRxiv [Preprint]. 2023 Sep 10:2023.09.08.23295131. doi: 10.1101/2023.09.08.23295131. medRxiv. 2023. Update in: Cancer Discov. 2024 Jun 3;14(6):1064-1081. doi: 10.1158/2159-8290.CD-23-0996. PMID: 37732244 Free PMC article. Updated. Preprint.

References

1. Pavlidis N, Briasoulis E, Hainsworth J, Greco FA. Diagnostic and therapeutic management of cancer of an unknown primary. Eur J Cancer 2003;39:1990–2005. - PubMed
1. Varghese AM, Arora A, Capanu M, Camacho N, Won HH, Zehir A, et al. . Clinical and molecular characterization of patients with cancer of unknown primary in the modern era. Ann Oncol 2017;28:3015–21. - PMC - PubMed
1. Kato S, Alsafar A, Walavalkar V, Hainsworth J, Kurzrock R. Cancer of unknown primary in the molecular Era. Trends Cancer 2021;7:465–77. - PMC - PubMed
1. Greco FA. Molecular diagnosis of the tissue of origin in cancer of unknown primary site: useful in patient management. Curr Treat Options Oncol 2013;14:634–42. - PubMed
1. Zehir A, Benayed R, Shah RH, Syed A, Middha S, Kim HR, et al. . Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat Med 2017;23:703–13. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Deep-Learning Model for Tumor-Type Prediction Using Targeted Clinical Genomic Sequencing Data

Affiliations

Deep-Learning Model for Tumor-Type Prediction Using Targeted Clinical Genomic Sequencing Data

Authors

Affiliations

Abstract

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical