A vision-language foundation model for precision oncology
- PMID: 39779851
- PMCID: PMC12295649
- DOI: 10.1038/s41586-024-08378-w
A vision-language foundation model for precision oncology
Abstract
Clinical decision-making is driven by multimodal data, including clinical notes and pathological characteristics. Artificial intelligence approaches that can effectively integrate multimodal data hold significant promise in advancing clinical care1,2. However, the scarcity of well-annotated multimodal datasets in clinical settings has hindered the development of useful models. In this study, we developed the Multimodal transformer with Unified maSKed modeling (MUSK), a vision-language foundation model designed to leverage large-scale, unlabelled, unpaired image and text data. MUSK was pretrained on 50 million pathology images from 11,577 patients and one billion pathology-related text tokens using unified masked modelling. It was further pretrained on one million pathology image-text pairs to efficiently align the vision and language features. With minimal or no further training, MUSK was tested in a wide range of applications and demonstrated superior performance across 23 patch-level and slide-level benchmarks, including image-to-text and text-to-image retrieval, visual question answering, image classification and molecular biomarker prediction. Furthermore, MUSK showed strong performance in outcome prediction, including melanoma relapse prediction, pan-cancer prognosis prediction and immunotherapy response prediction in lung and gastro-oesophageal cancers. MUSK effectively combined complementary information from pathology images and clinical reports and could potentially improve diagnosis and precision in cancer therapy.
© 2025. The Author(s), under exclusive licence to Springer Nature Limited.
Conflict of interest statement
Competing interests: A provisional patent related to this work has been filed by Stanford University (US patent application 63/724,237).
References
Main References
-
- Acosta JN, Falcone GJ, Rajpurkar P & Topol EJ Multimodal biomedical AI. Nature Medicine 28, 1773–1784 (2022). - PubMed
Method References
-
- Shazeer N et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538; (2017).
-
- Bao H et al. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35, 32897–32912 (2022).
-
- Esser P et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning (2024).
-
- Sun Y et al. Pathasst: A generative foundation AI assistant towards artificial general intelligence of pathology. In AAAI Conference on Artificial Intelligence (2023).
-
- Li J, Li D, Xiong C & Hoi SCH BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning (2022).