Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Dec 12;16(1):11406.
doi: 10.1038/s41467-025-66220-x.

A multimodal knowledge-enhanced whole-slide pathology foundation model

Affiliations

A multimodal knowledge-enhanced whole-slide pathology foundation model

Yingxue Xu et al. Nat Commun. .

Abstract

Computational pathology has advanced through foundation models, yet faces challenges in multimodal integration and capturing whole-slide context. Current approaches typically utilize either vision-only or image-caption data, overlooking distinct insights from pathology reports and gene expression profiles. Additionally, most models focus on patch-level analysis, failing to capture comprehensive whole-slide patterns. Here we present mSTAR (Multimodal Self-TAught PRetraining), the pathology foundation model that incorporates three modalities: pathology slides, expert-created reports, and gene expression data, within a unified framework. Our dataset includes 26,169 slide-level modality pairs across 32 cancer types, comprising over 116 million patch images. This approach injects multimodal whole-slide context into patch representations, expanding modeling from single to multiple modalities and from patch-level to slide-level analysis. Across oncological benchmark spanning 97 tasks, mSTAR outperforms previous state-of-the-art models, particularly in molecular prediction and multimodal tasks, revealing that multimodal integration yields greater improvements than simply expanding vision-only datasets.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the study.
a The workflow in clinical practice for diagnosis, treatment and prognosis of oncology, which primarily involves three common modalities data: WSIs, pathology reports and gene expression profiles. b The overview of mSTAR paradigm. mSTAR consists of two stages: 1) Slide-level Contrastive Learning, and 2) Patch-level Self-Taught Training. ce statistics of data used in this study, including (c) Venn Graph of cases across various modalities, d the number of cases in pretraining data across different cancer types. e the distribution of word count for pathology reports. f evaluation scheme in this study: including held-out, independent, external and zero-shot. The illustration is presented in Sec. ? g the distribution of datasets across different types of tasks for different evaluation scheme, and the detailed information about every dataset is presented in Supplementary Table 1. h The average performance spanning 15 types of 97 tasks across 7 categories of applications: Pathological Diagnosis, Molecular Prediction, Report Generation, Survival Prediction, Multimodal Fusion, Zero-shot Slide Classification, and Zero-shot Slide Retrieval. Zero-shot tasks, which require a well-aligned vision-language space, are evaluated for vision-language models only, i.e., PLIP, CONCH and mSTAR. Source data are provided as a Source Data file and presented in Supplementary Table 2 as well. This figure was created in BioRender. Zhou, Z. (https://BioRender.com/r035ixv).
Fig. 2
Fig. 2. The overview of mSTAR pipeline.
mSTAR is a whole-slide pretraining paradigm comprising two-stage pretraining. a Stage 1 aims to inject multimodal knowledge into a slide aggregator by slide-level contrastive learning among WSIs, pathology reports and gene expression data. b Stage 2 aims to seamlessly propagate multimodal knowledge learned at the slide level into the patch extractor by Self-Taught training, which leverages the slide aggregator pretrained in Stage 1 as “Teacher” and enforces patch extractor to be “Student''. This figure was created in BioRender. Zhou, Z. (https://BioRender.com/evctgc8).
Fig. 3
Fig. 3. Performance of pathological diagnosis on 21 datasets.
a The overall performance on pathological diagnosis. b The performance on 8 independent datasets. c The performance on 10 external datasets. The red lines and the values reported at the top of figures (a, b and c) refer to the averaged performance across datasets. Each point represents a dataset, with the size of the point indicating the standard deviation. d The performance on 3 held-out datasets. The minima and maxima bounds of boxes represent the minimum and maximum performance among corresponding datasets, respectively. e Task distribution of pathological diagnosis across sites for different evaluation. f The overall performance on Pathological Subtyping across 10 datasets. g The performance on 6 external datasets of Pathological Subtyping. Error bars represent standard errors across datasets for all bar plots in (fg). h, i The visualized validation of attention scores from mSTAR on h) CAMELYON and i) PANDA datasets. P-value for every group of experiments is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. * represents P < 0.05, ** means P < 0.01 and *** indicates P < 0.001. Detailed Performances of every dataset are presented in Supplementary Fig. 2 and Supplementary Table 7. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Performance of molecular prediction on 40 datasets across 10 cancer types.
a Overall Performance of Gene Mutation Prediction on 23 datasets. b Performance of Mutation Prediction on 18 held-out datasets. c Overall Performance of Immunohistochemistry (IHC) Biomarker Prediction on 10 datasets. d Performance of IHC Biomarker Prediction on 4 independent datasets. e Overall Performance of Molecular Subtyping on 7 datasets. f Performance of Molecular Subtyping on 4 held-out datasets. In subfigures b, d and f, the minima and maxima represent the minimum and maximum performance among corresponding datasets, respectively, while the center and the bound of box represent the mean performance, 25% and 75% percentiles, respectively. The red lines and the values reported at the top of figures (af) refer to the averaged performance across datasets. Each point represents a dataset, with the size of the point indicating the standard deviation. g Positive and Negative Ratios of gene mutation for every mutation dataset, including genes with high-frequency mutations highlighted in green and genes related to FDA-approved therapies highlighted in red. hj Internal (In) v.s. External (Ext) Evaluation. h Performance of Mutation Prediction on 5 internal and 5 external datasets. i Performance of IHC Biomarker Prediction on 3 internal and 3 external datasets. j Performance of Molecular Subtyping on 3 internal and 3 external datasets. Error bars represent standard errors across datasets for all bar plots in h-j. P-value for every group of experiments is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. * represents P < 0.05, ** means P < 0.01 and *** indicates P < 0.001. Detailed performances of every dataset spanning 10 cancer types are presented in Supplementary Fig. 3 and Supplementary Table 8--11. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Vision-language evaluation.
a The scheme of zero-shot evaluation. For zero-shot classification, we used class prompts as the text input. For zero-shot retrieval, the text input is a pathology report. b Performance of zero-shot slide classification on 6 independent datasets. The ʻOverallʼ refers to the averaged performance across these 6 datasets. Error bars represent 95% CI with 1000 bootstrap replicates for all bar plots. P-value is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. c Performance of zero-shot retrieval on an external dataset for Image-to-Text and Text-to-Image tasks. The results on held-out TCGA dataset are presented for reference only to be compared with zero-shot’s capability. d Performance of report generation on one held-out TCGA dataset and two external datasets. P-value for every group of experiments is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. Detailed performances of every dataset are presented in Supplementary Table 15–17. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. Performance of Survival Prediction on 16 datasets.
a Comparison of C-Index between mSTAR and compared methods on 9 held-out datasets. b Comparison of C-Index between mSTAR and compared methods on 4 external datasets. The red lines and the values reported at the top of figures (a, b) refer to the averaged performance across datasets. Each point represents a dataset, with the size of the point indicating the standard deviation. c Task distribution of various survival endpoints for different evaluation. d The performance (C-Index and 95% CI) on independent cohorts. `out' refers to the partitions held out from pretraining data. `idpt' means independent datasets with a data source that differs from the pretraining data. `ext' represents external datasets where data originates from a source distinct from the training data used for fine-tuning and is used solely for testing, without any training involved. Error bars represent 95% CI with 1000 bootstrap replicates for all bar plots. P-value for every group of experiments is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. * represents P < 0.05, ** means P < 0.01 and *** indicates P < 0.001. Detailed performances of every dataset are presented in Supplementary Table 18. Source data are provided as a Source Data file.
Fig. 7
Fig. 7. Multimodal fusion performance of overall survival prediction on pathological slides and gene expression data.
The patch extractors of all foundation models are evaluated with different multimodal fusion models (MCAT, Porpoise, MOTCat and CMTA), trained from scratch across 9 TCGA held-out datasets. a Performance of Ranking on 9 datasets of each FM on every multimodal fusion models and “Overall” that refers to the average results among these multimodal fusion methods. b The average C-Index on 9 datasets. c Performance (C-Index and 95% CI) on each dataset. The minima and maxima represent the lower and upper bounds of 95%CI, respectively. The center and the bound of box represent the mean value, 25% and 75% percentiles, respectively. P-value is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. The colors of legends are shared across all sub-figures. * represents P < 0.05, ** means P < 0.01 and *** indicates P < 0.001. Detailed performances of every dataset are presented in Supplementary Table 19–23. Source data are provided as a Source Data file.
Fig. 8
Fig. 8. Ablation studies.
a averaged performance on pathological diagnosis (3 datasets), molecular prediction (12 datasets) and survival prediction (9 datasets), where ʻBeforeʼ refers to before pretraining, and ʻPʼ, ʻTʼ and ʻGʼ indicate pathology slides, pathology reports and gene data, respectively. Error bars represent standard errors across datasets for all bar plots. b visualization of feature space evolution: from before pretraining (initial) to Stage 1 (pretrained aggregator) and Stage 2 (mSTAR), where the areas in red bounding box are multiple tumor regions (1-7) of the case of patient_042_node_3 of CAMELYON17 dataset. Note that different tumor areas correspond to different spatial positions. c averaged performance (9 TCGA OS datasets) for ablating different pretraining objectives (Inter-modal Loss and Inter-cancer Loss) for survival prediction (Supplementary Table 4). d averaged performance (24 datasets) and resources comparisons between scaling slides only (Virchow) v.s. scaling modalities (mSTAR) for pretraining, with UNI as a baseline. Detailed performances of every dataset are presented in Supplementary Fig. 8 and detailed comparisons are showcased in Supplementary Table 5–6. Source data are provided as a Source Data file.

References

    1. Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical twitter. Nat. Med.29, 2307–2316 (2023). - DOI - PubMed
    1. Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med.30, 850–862 (2024). - DOI - PMC - PubMed
    1. Lu, M. Y. et al. A visual-language foundation model for computational pathology. Nat. Med.30, 863–874 (2024). - DOI - PMC - PubMed
    1. Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature630, 181–188 (2024). - PMC - PubMed
    1. Alfasly, S. et al. When is a foundation model a foundation model. arXiv preprint arXiv:2309.11510 (2023).

LinkOut - more resources