Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb;638(8051):769-778.
doi: 10.1038/s41586-024-08378-w. Epub 2025 Jan 8.

A vision-language foundation model for precision oncology

Affiliations

A vision-language foundation model for precision oncology

Jinxi Xiang et al. Nature. 2025 Feb.

Abstract

Clinical decision-making is driven by multimodal data, including clinical notes and pathological characteristics. Artificial intelligence approaches that can effectively integrate multimodal data hold significant promise in advancing clinical care1,2. However, the scarcity of well-annotated multimodal datasets in clinical settings has hindered the development of useful models. In this study, we developed the Multimodal transformer with Unified maSKed modeling (MUSK), a vision-language foundation model designed to leverage large-scale, unlabelled, unpaired image and text data. MUSK was pretrained on 50 million pathology images from 11,577 patients and one billion pathology-related text tokens using unified masked modelling. It was further pretrained on one million pathology image-text pairs to efficiently align the vision and language features. With minimal or no further training, MUSK was tested in a wide range of applications and demonstrated superior performance across 23 patch-level and slide-level benchmarks, including image-to-text and text-to-image retrieval, visual question answering, image classification and molecular biomarker prediction. Furthermore, MUSK showed strong performance in outcome prediction, including melanoma relapse prediction, pan-cancer prognosis prediction and immunotherapy response prediction in lung and gastro-oesophageal cancers. MUSK effectively combined complementary information from pathology images and clinical reports and could potentially improve diagnosis and precision in cancer therapy.

PubMed Disclaimer

Conflict of interest statement

Competing interests: A provisional patent related to this work has been filed by Stanford University (US patent application 63/724,237).

References

Main References

    1. Sammut S-J et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature 601, 623–629 (2022). - PMC - PubMed
    1. Vanguri RS et al. Multimodal integration of radiology, pathology and genomics for prediction of response to pd-(l) 1 blockade in patients with non-small cell lung cancer. Nature cancer 3, 1151–1164 (2022). - PMC - PubMed
    1. Acosta JN, Falcone GJ, Rajpurkar P & Topol EJ Multimodal biomedical AI. Nature Medicine 28, 1773–1784 (2022). - PubMed
    1. Boehm KM, Khosravi P, Vanguri R, Gao J & Shah SP Harnessing multimodal data integration to advance precision oncology. Nature Reviews Cancer 22, 114–126 (2022). - PMC - PubMed
    1. Lipkova J et al. Artificial intelligence for multimodal data integration in oncology. Cancer cell 40, 1095–1110 (2022). - PMC - PubMed

Method References

    1. Shazeer N et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538; (2017).
    1. Bao H et al. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35, 32897–32912 (2022).
    1. Esser P et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning (2024).
    1. Sun Y et al. Pathasst: A generative foundation AI assistant towards artificial general intelligence of pathology. In AAAI Conference on Artificial Intelligence (2023).
    1. Li J, Li D, Xiong C & Hoi SCH BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning (2022).