. 2025 Feb;638(8051):769-778.

doi: 10.1038/s41586-024-08378-w. Epub 2025 Jan 8.

A vision-language foundation model for precision oncology

Jinxi Xiang^#¹, Xiyue Wang^#¹, Xiaoming Zhang², Yinghua Xi¹, Feyisope Eweje¹, Yijiang Chen¹, Yuchen Li¹, Colin Bergstrom³, Matthew Gopaulchan¹, Ted Kim¹, Kun-Hsing Yu⁴, Sierra Willens³, Francesca Maria Olguin³, Jeffrey J Nirschl², Joel Neal³, Maximilian Diehn¹, Sen Yang⁵, Ruijiang Li^{6

7}

Affiliations

¹ Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA.
² Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA.
³ Department of Medicine (Oncology), Stanford University School of Medicine, Stanford, CA, USA.
⁴ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
⁵ Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA. seny@stanford.edu.
⁶ Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA. rli2@stanford.edu.
⁷ Stanford Institute for Human-Centered Artificial Intelligence, Stanford, CA, USA. rli2@stanford.edu.

^# Contributed equally.

PMID: 39779851
PMCID: PMC12295649
DOI: 10.1038/s41586-024-08378-w

A vision-language foundation model for precision oncology

Jinxi Xiang et al. Nature. 2025 Feb.

. 2025 Feb;638(8051):769-778.

doi: 10.1038/s41586-024-08378-w. Epub 2025 Jan 8.

Authors

Affiliations

¹ Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA.
² Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA.
³ Department of Medicine (Oncology), Stanford University School of Medicine, Stanford, CA, USA.
⁴ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
⁵ Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA. seny@stanford.edu.
⁶ Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA. rli2@stanford.edu.
⁷ Stanford Institute for Human-Centered Artificial Intelligence, Stanford, CA, USA. rli2@stanford.edu.

^# Contributed equally.

PMID: 39779851
PMCID: PMC12295649
DOI: 10.1038/s41586-024-08378-w

Abstract

Clinical decision-making is driven by multimodal data, including clinical notes and pathological characteristics. Artificial intelligence approaches that can effectively integrate multimodal data hold significant promise in advancing clinical care^1,2. However, the scarcity of well-annotated multimodal datasets in clinical settings has hindered the development of useful models. In this study, we developed the Multimodal transformer with Unified maSKed modeling (MUSK), a vision-language foundation model designed to leverage large-scale, unlabelled, unpaired image and text data. MUSK was pretrained on 50 million pathology images from 11,577 patients and one billion pathology-related text tokens using unified masked modelling. It was further pretrained on one million pathology image-text pairs to efficiently align the vision and language features. With minimal or no further training, MUSK was tested in a wide range of applications and demonstrated superior performance across 23 patch-level and slide-level benchmarks, including image-to-text and text-to-image retrieval, visual question answering, image classification and molecular biomarker prediction. Furthermore, MUSK showed strong performance in outcome prediction, including melanoma relapse prediction, pan-cancer prognosis prediction and immunotherapy response prediction in lung and gastro-oesophageal cancers. MUSK effectively combined complementary information from pathology images and clinical reports and could potentially improve diagnosis and precision in cancer therapy.

PubMed Disclaimer

Conflict of interest statement

Competing interests: A provisional patent related to this work has been filed by Stanford University (US patent application 63/724,237).

Figures

**Extended Data Fig. 1.. MUSK for image-to-image retrieval and image classification.**
a. We perform zero-shot image retrieval on the UniToPatho dataset, and MUSK outperforms other vision-language foundation models. Data is represented as mean with 95% confidence intervals. Error bars represent the 95% confidence intervals, estimated using the bootstrap method with 1000 replicates. The two-sided Wilcoxon signed-rank test is used to calculate the statistical differences between MUSK and the compared methods (p<0.0001 in Recall@1, Recall@3, Recall@5, and mMv@5). b. Zero-shot image retrieval on the BRACS dataset. MUSK significantly outperforms other foundation models across various recall levels with p-values of 0.02, 0.07, 0.04, and 0.03 in Recall@1, Recall@3, Recall@5, and mMv@5 metrics, respectively. Two examples of image retrieval results with the top 3 candidates are shown. DCIS: ductal carcinoma in situ; IBC: invasive breast carcinoma. c. We evaluate the labeling efficiency of various models under a few-shot learning scenario by varying the number of training labels per class. We present results for the [1, 2, 4, 8, 10]-shot classification across multiple datasets: LC25000 , UniToPatho , NCT-CRC , and BRACS (6 cls) . The average accuracy shows that MUSK consistently outperforms existing models across these benchmarks. In these box plots, the central lines indicate the median, box bounds are the 25th and 75th percentiles, and whiskers extend to 1.5 times the interquartile range.

**Extended Data Fig. 2.. MUSK for supervised image classification.**
a. 10-shot classification performance across 12 benchmarks compared with seven alternative vision-language models regarding classification balanced accuracy. The two-sided Wilcoxon signed-rank test is used to assess the statistical differences between MUSK and the compared methods in the 12 benchmark datasets: BRCAS (3-cls) (p=0.43), UniToPtho (p=0.002), BRCAS (6-cls) (p=0.002), SICAPv2 (p=0.01), PatchCamelyon (p=0.006), LC25000 (p=0.002), PanNuke (p=0.23), RenalCell (p=0.002), SkinCancer (p=0.01), NCT-CRC-HE-100K (p=0.55), Osteo (p=0.04), and WSSS4LUAD (p=0.006). b. Linear probe classification results on 12 benchmark datasets compared with seven alternative models. The two-sided Wilcoxon signed-rank test is used to calculate the statistical differences between MUSK and the compared methods in the 12 benchmark datasets. P-values are observed as follows: BRCAS (3-cls) (p=0.002), UniToPtho (p=0.002), BRCAS (6-cls) (p=0.01), SICAPv2 (p=0.13), PatchCamelyon (p=0.002), LC25000 (p=0.002), PanNuke (p=0.002), RenalCell (p=0.002), Skin-Cancer (p=0.002), NCT-CRC-HE-100K (p=0.55), Osteo (p=0.002), and WSSS4LUAD (p=0.002). In a and b, error bars represent the 95% confidence intervals, which are computed from 10 experiments using different seeds.

**Extended Data Fig. 3.. Comparison of MUSK with state-of-the-art pathology foundation models on slide-level benchmark tasks.**
The comparison methods include both unimodal pathology foundation models (UNI, GigaPath, and Virchow) and multimodal pathology foundation models (PLIP and CONCH). a. Biomarker prediction. AUC results for predicting ER, PR, and *HER2* status in the BCNB test set, as well as IDH status in the MUV-IDH dataset. b. Immunotherapy response prediction. Performance in terms of AUC and c-index for lung and gastro-esophageal cancers, respectively. c. Prognosis prediction. C-index results for prognosis predictions across 16 TCGA cohorts. MUSK significantly outperforms the compared methods as shown in the overall bars (p-value < 0.0001), representing the average performance across 16 projects. In a-c, data are represented as mean with standard deviations, based on 5-fold cross-validation experiments. The two-sided Mann-Whitney U test is used to assess the statistical significance between MUSK and the comparison methods. ****p < 0.0001.

**Extended Data Fig. 4.. Melanoma relapse prediction.**
**a-b.** MUSK achieves superior performance for predicting the 5-year risk of relapse in a cohort of 1,342 melanoma patients compared with existing multimodal pathology foundation models. c. At 90% sensitivity for relapse prediction, MUSK substantially improved the specificity by about 15% over other foundation models. d. The multimodal MUSK model significantly improves upon relapse prediction over models based on clinical reports or whole slide images alone. e. Two examples of melanoma cases with and without relapse. In each panel, the left image shows the original WSI, while the middle image displays the corresponding heatmaps that highlights the regions model focused on within the WSIs. The right images provide zoomed-in views of the regions receiving the most model attention. The case with relapse shows the presence of skin ulceration with abundant intratumoral macrophages accompanied by fibrosis, less intratumoral and peritumoral lymphocytes, and brisk mitotic activity. On the other hand, the case without relapse shows an intact epidermis without ulceration, abundant intratumoral and peritumoral lymphocytes, and inconspicuous mitotic activity. In a, c, and d, the error bars represent the standard deviations computed from 5-fold cross-validation experiments, and the two-sided Mann-Whitney U test is used to measure the statistical significance between MUSK and the compared methods.

**Extended Data Fig. 5.. Gastro-esophageal cancer immunotherapy response prediction.**
a. MUSK outperforms other foundation models for predicting objective response and progression-free survival in gastroesophageal cancer patients treated with immunotherapy. b. The multimodal MUSK model improves upon models based on clinical reports or whole-slide images alone. c. Kaplan-Meier analysis demonstrates that MUSK significantly stratifies patients into high-risk and low-risk groups for progression-free survival, in the entire cohort and clinically relevant subgroups. The two-sided log-rank test is used to compare the survival differences between the high-risk and low-risk groups. HR: hazard ratio. d. Multivariate Cox regression analysis shows MUSK is the only significant predictor of progression-free survival beside MSI status. We computed the P-values using the two-sided Wald test and we presented the hazard ratio (HR) with 95% confidence intervals. e. Two examples of gastro-esophageal cancer cases with and without objective response to immunotherapy. In each panel, the left image shows the original WSI, while the middle image displays the corresponding heatmaps that highlights the regions model focused on within the WSIs. The right images provide zoomed-in views of the regions receiving the most model attention. The case with response shows abundant infiltration of lymphocytes within and around the tumor; the stroma is less fibrotic and displays more edema. On the other hand, the case without response shows minimal lymphocyte infiltration and increased intra-tumoral and peri-tumoral fibrotic stroma. CPS: combined positive score; MSI/MSS: microsatellite instable/stable; ADC: adenocarcinoma; SCC: squamous cell carcinoma. In a and b, the error bars represent the mean with standard deviation computed from 5-fold cross-validation experiments, and the two-sided Mann-Whitney U test is used to measure the statistical significance between MUSK and the compared methods.

**Extended Data Fig. 6.. Kaplan-Meier analysis of the MUSK model for stratifying patients under different treatment regimens.**
The results demonstrate that MUSK significantly stratifies patients for progression-free survival into low- and high-risk groups, in (a) lung and (b) gastro-esophageal cancer, treated with immunotherapy with or without concurrent chemotherapy.

**Extended Data Fig. 7.. MUSK model configuration.**
a. MUSK integrates two independent Transformers for image and text data modalities. This architecture processes sequences within each modality independently and fuses them in the attention modules, allowing cross-modal interactions and ensuring robustness across different data types. b. During the second phase of pretraining (Figure 1), MUSK requires modality alignment using contrastive loss as its training objective, augmented with an auxiliary MLM loss. This MLM component utilizes a streamlined cross-attention decoder that employs text embeddings as queries to interact dynamically with image embeddings, thereby instilling intricate cross-modal insights.

**Extended Data Fig. 8.. Ablation study on training configuration.**
We conducted ablation studies to evaluate the impact of various training configurations (refer to the Supplementary Materials for detailed descriptions): a. The effect of mask pretraining. b. The effect of data distribution, comparing natural images/text with pathology images/text. c. The impact of the data scale for mask pretraining was evaluated using Quilt1M, 15M images with 500M text tokens, and 50M images with 1B text tokens. d. The model capacity of the MUSK models. In a-d, the error bars represent the average of the standard deviations within the datasets. Error bars are not provided for the I2T retrieval, T2I retrieval, and VQA tasks because they are evaluated on a single dataset. The evaluation metrics used are balanced accuracy for the linear probe classification, 10-shot classification, and zero-shot classification tasks; accuracy for the VQA task; *Recall*@50 for the I2T and T2I retrievals; and *mMV* @5 for the I2I retrieval. T2I: Text-to-Image; I2T: Image-to-Text; I2I: Image-to-Image; VQA: Visual Question Answering; cls: classification.

**Fig. 1.. Data curation, model development and evaluation.**
a. MUSK model pretraining. We develop a vision-language foundation model built upon a multimodal transformer architecture as the network backbone. The model pretraining consists of two sequential phases. First, MUSK is pretrained on a total of 50 million pathology images and 1 billion pathology-related text tokens. The images originated from nearly 33,000 whole-slide histopathology scans from 11,577 patients representing 33 tumor types. Adapted from the BEiT3 architecture, the MUSK model consists of shared self-attention blocks and two independent experts for vision and language inputs; pretraining is achieved by masked modeling. Second, MUSK is pretrained on 1 million image-text pairs from Quilt1M using contrastive learning for multimodal alignment. b. General-purpose clinical applications. Once the pretraining is complete, MUSK can be used for various downstream tasks with minimal or no further training. Importantly, we evaluate MUSK using whole-slide images and clinical reports for outcome prediction, including relapse, prognosis, and immunotherapy response prediction. MUSK substantially improves upon state-of-the-art vision-language foundation models, including CLIP, Quilt1M, BiomedCLIP, and CONCH. The graphics of reports, melanoma, prognosis, lung cancer, and gastro-esophageal cancer in b are created with Biorender.com.

**Fig. 2.. Cross-modal retrieval and visual question answering.**
a. Zero-shot image-to-text and text-to-image retrieval. MUSK consistently outperforms existing foundation models across different recall levels on the BookSet and PathMMU. The two-sided Wilcoxon signed-rank test is used to assess the statistical differences between MUSK and the second-best model (CONCH). Visual examples are shown in the Supplementary Fig. 4. b. Visual question answering. MUSK substantially outperforms existing foundation models in the PathVQA benchmark dataset. Notably, MUSK improved accuracy by 7% over the best performing model (K-PathVQA) specifically trained for visual question answering. Some examples of the results are shown for MUSK and PLIP model. The two-sided Mann-Whitney U test is used to evaluate statistical significance. For VQA tasks-specific models, no confidence intervals were reported in the original papers. In **a-b**, data in foundation models is represented as mean with 95% confidence intervals estimated using the bootstrap method (n=1,000 replicates).

**Fig. 3.. Patch-level image classification.**
a. Zero-shot image classification. MUSK consistently outperforms 7 alternative foundation models when evaluated on the UniToPatho, SkinCancer, PatchCamelyon, and PanNuke benchmark datasets, with p-values < 0.0001. b. 10-shot image classification. MUSK consistently outperforms other foundation models across 12 benchmark datasets. Two-sided Wilcoxon signed-rank tests are used to calculate the statistical differences between MUSK and the top-performing alternative model. The data are presented with means and 95% confidence intervals (error bars). These intervals are estimated using the bootstrap method (n=1,000 replicates) (a); or calculated from n=10 independent experiments (b).

**Fig. 4.. Prognosis prediction across 16 cancer types.**
a. Kaplan-Meier analyses show that MUSK can significantly stratify patients for disease-specific survival across 16 cancer types, with hazard ratios ranging from 1.59 for glioblastoma multiforme to 36.83 for renal cell carcinoma. The two-sided log-rank test is used to compare the survival differences between the high-risk and low-risk groups (cutoff: median). HR: hazard ratio. b. The multimodal MUSK model significantly improves prognosis prediction over models based on clinical reports or whole-slide images alone as shown in the overall bars (p-value < 0.0001). The overall bars represent the average performance across 16 projects. Bladder Urothelial Carcinoma (BLCA), Breast Invasive Carcinoma (BRCA), Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC), Colorectal Adenocarcinoma Rectal Adenocarcinoma (COADREAD), Esophageal Carcinoma (ESCA), Glioblastoma Multiforme (GBM), Head and Neck Squamous Cell Carcinoma (HNSC), Low-Grade Glioma (LGG), Liver Hepatocellular Carcinoma (LIHC), Lung Adenocarcinoma (LUAD), Lung Squamous Cell Carcinoma (LUSC), Pancreatic Adenocarcinoma (PAAD), Renal Cell Carcinoma (RCC), Skin Cutaneous Melanoma (SKCM), Stomach Adenocarcinoma (STAD) and Uterine Corpus Endometrial Carcinoma (UCEC). In b, data is represented as mean with standard deviation calculated using 5-fold cross-validation experiments. The two-sided Mann-Whitney U test is used to assess the statistical significance between MUSK and the comparison methods.

**Fig. 5.. Lung cancer immunotherapy response prediction.**
a. MUSK substantially outperforms other foundation models for predicting objective response and progression-free survival in non-small cell lung cancer patients treated with immunotherapy. b. The multimodal MUSK model significantly improves upon models based on clinical reports or whole-slide images alone for predicting immunotherapy response and outcomes. c. Kaplan-Meier analysis demonstrates that MUSK significantly stratifies patients into high-risk and low-risk groups for progression-free survival in the entire cohort and in clinically relevant subgroups defined by PD-L1 expression and EGFR mutation status. The two-sided log-rank test is used to compare the survival differences between the high-risk and low-risk groups. HR: hazard ratio. d. Two examples of lung cancer cases with and without objective response to immunotherapy. In each panel, the left image shows the original WSI, while the middle image displays the corresponding heatmaps that highlights the regions model focused on within the WSIs. The right images provide zoomed-in views of the regions receiving the most model attention. The case with response shows abundant infiltration of lymphocytes and minimal stroma. On the other hand, the case without response shows minimal lymphocyte infiltration and abundant stroma. TPS: tumor proportion score. In a and b, the error bars represent the mean with standard deviation computed from 5-fold cross-validation experiments, and the two-sided Mann-Whitney U test is used to measure the statistical significance between MUSK and the compared methods.

See this image and copyright information in PMC

References

Main References

1. Sammut S-J et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature 601, 623–629 (2022). - PMC - PubMed
1. Vanguri RS et al. Multimodal integration of radiology, pathology and genomics for prediction of response to pd-(l) 1 blockade in patients with non-small cell lung cancer. Nature cancer 3, 1151–1164 (2022). - PMC - PubMed
1. Acosta JN, Falcone GJ, Rajpurkar P & Topol EJ Multimodal biomedical AI. Nature Medicine 28, 1773–1784 (2022). - PubMed
1. Boehm KM, Khosravi P, Vanguri R, Gao J & Shah SP Harnessing multimodal data integration to advance precision oncology. Nature Reviews Cancer 22, 114–126 (2022). - PMC - PubMed
1. Lipkova J et al. Artificial intelligence for multimodal data integration in oncology. Cancer cell 40, 1095–1110 (2022). - PMC - PubMed

Method References

1. Shazeer N et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538; (2017).
1. Bao H et al. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35, 32897–32912 (2022).
1. Esser P et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning (2024).
1. Sun Y et al. Pathasst: A generative foundation AI assistant towards artificial general intelligence of pathology. In AAAI Conference on Artificial Intelligence (2023).
1. Li J, Li D, Xiong C & Hoi SCH BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning (2022).

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A vision-language foundation model for precision oncology

Affiliations

A vision-language foundation model for precision oncology

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Main References

Method References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical