Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Mar 16:2025.02.26.640189.
doi: 10.1101/2025.02.26.640189.

Path2Omics: Enhanced transcriptomic and methylation prediction accuracy from tumor histopathology

Affiliations

Path2Omics: Enhanced transcriptomic and methylation prediction accuracy from tumor histopathology

Danh-Tai Hoang et al. bioRxiv. .

Update in

Abstract

Precision oncology is becoming increasingly integral to clinical practice, demonstrating notable improvements in treatment outcomes. While molecular data provide comprehensive insights, obtaining such data remains costly and time-consuming. To address this challenge, we developed Path2Omics, a deep learning model that predicts gene expression and methylation from histopathology for 23 cancer types. Path2Omics was trained on 20,497 slides (9,456 formalin-fixed and paraffin-embedded (FFPE) and 11,041 fresh frozen (FF)) from 8,007 patients across 23 The Cancer Genome Atlas cohorts. When tested on FFPE slides, the most readily available format in clinical pathology practice, the integrated model outperformed its individual FF and FFPE components, robustly predicting nearly 5,000 genes on average, approximately five times more than our recently published DeepPT model. Externally evaluated on seven independent cohorts, Path2Omics robustly predicted the expression of approximately 4,400 genes, yielding a 30% increase over the FFPE model alone. Finally, we demonstrate that the inferred gene expression is nearly as effective as the actual values in predicting patient survival and treatment response. These results lay the basis for using Path2Omics to advance precision oncology from histopathology slides in a speedy and cost-effective manner.

PubMed Disclaimer

Conflict of interest statement

Competing interest E.R. is a co-founder of Medaware, Metabomed and Pangea Biomed (divested from the latter). E.R. serves as a non-paid scientific consultant to Pangea Biomed under a collaboration agreement between Pangea Biomed and the NCI. E.R. also serves as a scientific advisory board member of GSK oncology. The other authors declare no competing interests.

Figures

Extended Data Fig. 1.
Extended Data Fig. 1.. The distribution of correlations between the predicted and actual gene expression values across the cohort samples.
The violin plots demonstrate the distribution of Pearson correlations between predicted and measured expression values across the cohort samples for each gene achieved by the Path2Omics-FFPE model on FFPE slides (gray), and the Path2Omics-FF model on FF slides (pink). In the violin plots, the central mark represents the median. The number of patients in each cohort is shown in parentheses.
Extended Data Fig. 2.
Extended Data Fig. 2.. The overlap among the well-predicted genes achieved by each model on external cohorts.
(a) The Venn diagrams demonstrate the overlap between the well-predicted genes achieved by the FFPE model and the FF model. (b) Similar Venn diagrams are shown for the genes that were well-predicted by both the FFPE model and the FF model (FFPE-FF Comm), and the genes that were well-predicted by the Integrated model. (c) Similar Venn diagrams are shown for the overlap between the genes that were well-predicted by either the FFPE model or the FF model (FFPE-FF Union) and the genes that were well-predicted by the Integrated model.
Extended Data Fig. 3.
Extended Data Fig. 3.. Gene set enrichment analysis identifying pathways associated with the well-predicted genes, achieved by Path2Omics-integrated model on FFPE slides.
P-values were calculated using a one-sided permutation test for gene set enrichment analysis. Light blue bars denote significance (p-values < 0.05), gray bars denote non-significance.
Extended Data Fig. 4.
Extended Data Fig. 4.. Gene set enrichment analysis identifying pathways associated with the well-predicted genes, achieved by Path2Omics-integrated model on FF slides.
P-values were calculated using a one-sided permutation test for gene set enrichment analysis. Light blue bars denote significance (p-values < 0.05), gray bars denote non-significance.
Extended Data Fig. 5.
Extended Data Fig. 5.. The distribution of correlations between the predicted and actual DNA methylation beta values across the cohort samples.
The violin plots demonstrate the distribution of Pearson correlations between predicted and measured DNA methylation beta values across the cohort samples for each CpG site achieved by the Path2Omics-FFPE model on FFPE slides (gray) and the Path2Omics-FF model on FF slides (pink). In the violin plots, the central mark represents the median. The number of patients in each cohort is shown in parentheses.
Extended Data Fig. 6.
Extended Data Fig. 6.. Path2Omics performance in predicting DNA methylation.
(a) The number of well-predicted CpG sites, defined as having a Pearson correlation between predicted and actual methylation beta values across the cohort samples above 0.4, achieved by the Path2Omics-FFPE model on FFPE slides (gray), and the Path2Omics-FF model on FF slides (pink). The first bars show the average across the 23 cohorts. The number of patients in each cohort is shown in parentheses. (b) The number of well-predicted sites archived by the FFPE model (gray), the FF model (pink) and the Integrated model (light blue) when tested on FFPE slides across 23 TCGA cancer cohorts. The last three columns show the average across the 23 cohorts. (c) Similar plot to (b), but when tested on FF slides.
Extended Data Fig. 7.
Extended Data Fig. 7.
Overlap between patients in high-risk groups assigned by models using predicted (“Predicted GE”, gray) and actual (“Actual GE”, pink) gene expression.
Extended Data Fig. 8.
Extended Data Fig. 8.. Model performance in predicting patient survival based on the inferred and measured methylation.
Kaplan-Meier curves were generated from the model using predicted methylation and patient demographics (sex, age) (“Predicted MT”, left panels) and compared with those from the model using actual methylation and patient demographics (“Actual MT”, right panels) across 12 cancer cohorts. In each cohort, patients were stratified into high-risk (red) and low-risk (blue) groups based on the median risk score generated by each model. P-values were calculated using a two-sided log-rank test.
Extended Data Fig. 9.
Extended Data Fig. 9.
Overlap between patients in high-risk groups identified by models using predicted (“Predicted MT”, gray) and actual (“Actual MT”, pink) methylation.
Fig. 1.
Fig. 1.. Overview of the computational workflow.
(a) Path2Omics consists of two components: an FFPE model and an FF model. Both models share the same architecture and comprise three main units: image pre-processing, feature extraction, and regression. In the image pre-processing step, whole slide images were divided into tiles, followed by color normalization. In the feature extraction unit, the CTransPath digital pathology foundational model was used to encode each tile image into a 768-dimensional feature vector. Finally, in the regression unit, a multi-layer perceptron (MLP) was employed to process the extracted features and predict gene expression or methylation values. Each component was trained and cross-validated using the TCGA dataset, independently for each cancer type. A total of 20,497 slides (9,456 FFPE and 11,041 FF) and their matched gene expression and methylation profiles from 8,007 patients across 23 TCGA cancer types were used for training. (b) For external validation, we analyzed 1,323 slides (1,163 FFPE and 160 FF) from seven datasets obtained from three independent resources: in-house (NCI), CPTAC, and TransNeo. (c) To assess the clinical application of the inferred gene expression, we built models to predict patient survival and treatment response based on the predicted gene expressions.
Fig. 2.
Fig. 2.. Path2Omics performance in predicting gene expression.
(a) The number of well-predicted genes, defined as those with a Pearson correlation above 0.4 between predicted and actual expression values across the cohort samples, achieved by the Path2Omics-FFPE model on FFPE slides (gray), and the Path2Omics-FF model on FF slides (pink). The first bars represent the average across the 23 cohorts. The number of patients in each cohort is shown in parentheses. (b) The number of well-predicted genes achieved by the FFPE model (gray), FF model (pink) and Integrated model (light blue) when tested on FFPE slides across 23 TCGA cancer cohorts. The first bars represent the average across the 23 cohorts. (c) Similar plot to (b), but when tested on FF slides. (d) Benchmarking against the state-of-the-art approach, DeepPT. The number of well-predicted genes achieved by the DeepPT (orange), Path2Omics-FFPE model (gray), Path2Omics-FF model (pink) and Path2Omics-Integrated model (light blue) on FFPE slides from the 15 cancer types analyzed by both DeepPT and Path2Omics. (e) Similar plot to (d), but when tested on 7 external datasets, including FFPE and FF slides.
Fig. 3.
Fig. 3.. Cancer pathway enrichment analysis and the correlation between model performance in predicting gene expression and methylation.
The number of cancer types in which each cancer hallmark was enriched with well-predicted genes, as achieved by running the Path2Omics-integrated models on either FFPE slides (a) or FF slides (b). The association between the number of well-predicted genes achieved by Path2Omics integrated model in predicting gene expression models (GE models, x-axis) and in predicting methylation (MT models, y-axis) across 23 TCGA cancer types, when tested on FFPE slides (c) and FF slides (d). Each data point represents one cancer type.
Fig. 4.
Fig. 4.. Model performance in predicting patient survival based on inferred and measured gene expression.
(a) Kaplan-Meier curves generated by model using predicted gene expression and patient demographics (sex, age) (“Predicted GE”, left panels) compared with those from the model using actual gene expression and patient demographics (“Actual GE”, right panels) across 12 cancer cohorts. In each cohort, high-risk (red) and low-risk (blue) groups were stratified based on the median risk score generated by each model. P-values were calculated using a two-sided log-rank test. (b) Concordance index achieved by the “Predicted GE” model (light blue) compared with the “Actual GE” model (gray) and a control model using only patient sex and age (orange). The first bars represent the average C-index across the 12 cohorts. The number of patients in each cohort is indicated in parentheses.
Fig. 5.
Fig. 5.. Model performance in predicting patient response to cancer therapy.
AUC (a), accuracy (b), and odds ratio (c) for the chemotherapy cohort (left column) and trastuzumab cohort (right column) are shown for three models: the “Direct” model that predicts response directly from pathology images (orange) without intermediate step of predicted gene expression, the “Predicted GE” model that predicts response from predicted gene expression (dark blue), and the “Actual GE” model that predicts response from actual gene expression (gray).

References

    1. Alsaafin Areej, Safarpoor Amir, Sikaroudi Milad, Hipp Jason D., and Tizhoosh H. R.. 2023. “Learning to Predict RNA Sequence Expressions from Whole Slide Images with Applications for Search and Classification.” Communications Biology 6 (1): 304. - PMC - PubMed
    1. Beck Andrew H., Sangoi Ankur R., Leung Samuel, Marinelli Robert J., Nielsen Torsten O., van de Vijver Marc J., West Robert B., van de Rijn Matt, and Koller Daphne. 2011. “Systematic Analysis of Breast Cancer Morphology Uncovers Stromal Features Associated with Survival.” Science Translational Medicine 3 (108): 108ra113. - PubMed
    1. Boehm Kevin M., Aherne Emily A., Ellenson Lora, Nikolovski Ines, Alghamdi Mohammed, Vázquez-García Ignacio, Zamarin Dmitriy, et al. 2022. “Multimodal Data Integration Using Machine Learning Improves Risk Stratification of High-Grade Serous Ovarian Cancer.” Nature Cancer 3 (6): 723–33. - PMC - PubMed
    1. Bulten Wouter, Pinckaers Hans, van Boven Hester, Vink Robert, de Bel Thomas, van Ginneken Bram, van der Laak Jeroen, Hulsbergen-van de Kaa Christina, and Litjens Geert. 2020. “Automated Deep-Learning System for Gleason Grading of Prostate Cancer Using Biopsies: A Diagnostic Study.” The Lancet Oncology 21 (2): 233–41. - PubMed
    1. Chang P., Grinband J., Weinberg B. D., Bardis M., Khy M., Cadena G., Su M-Y, et al. 2018. “Deep-Learning Convolutional Neural Networks Accurately Classify Genetic Mutations in Gliomas.” AJNR. American Journal of Neuroradiology 39 (7): 1201–7. - PMC - PubMed

Publication types

LinkOut - more resources