Vision-language foundation models for medical imaging: a review of current practices and innovations

Ji Seung Ryu^#¹, Hyunyoung Kang^#², Yuseong Chu¹, Sejung Yang^{1

2}

Affiliations

¹ Department of Precision Medicine, Yonsei University Wonju College of Medicine, Wonju, Korea.
² Department of Medical Informatics and Biostatistics, Yonsei University Wonju College of Medicine, Wonju, Republic of Korea.

^# Contributed equally.

PMID: 40917147
PMCID: PMC12411343
DOI: 10.1007/s13534-025-00484-6

Review

Vision-language foundation models for medical imaging: a review of current practices and innovations

Ji Seung Ryu et al. Biomed Eng Lett. 2025.

. 2025 Jun 6;15(5):809-830.

doi: 10.1007/s13534-025-00484-6. eCollection 2025 Sep.

Authors

Ji Seung Ryu^#¹, Hyunyoung Kang^#², Yuseong Chu¹, Sejung Yang^{1

2}

Affiliations

¹ Department of Precision Medicine, Yonsei University Wonju College of Medicine, Wonju, Korea.
² Department of Medical Informatics and Biostatistics, Yonsei University Wonju College of Medicine, Wonju, Republic of Korea.

^# Contributed equally.

PMID: 40917147
PMCID: PMC12411343
DOI: 10.1007/s13534-025-00484-6

Abstract

Foundation models, including large language models and vision-language models (VLMs), have revolutionized artificial intelligence by enabling efficient, scalable, and multimodal learning across diverse applications. By leveraging advancements in self-supervised and semi-supervised learning, these models integrate computer vision and natural language processing to address complex tasks, such as disease classification, segmentation, cross-modal retrieval, and automated report generation. Their ability to pretrain on vast, uncurated datasets minimizes reliance on annotated data while improving generalization and adaptability for a wide range of downstream tasks. In the medical domain, foundation models address critical challenges by combining the information from various medical imaging modalities with textual data from radiology reports and clinical notes. This integration has enabled the development of tools that streamline diagnostic workflows, enhance accuracy (ACC), and enable robust decision-making. This review provides a systematic examination of the recent advancements in medical VLMs from 2022 to 2024, focusing on modality-specific approaches and tailored applications in medical imaging. The key contributions include the creation of a structured taxonomy to categorize existing models, an in-depth analysis of datasets essential for training and evaluation, and a review of practical applications. This review also addresses ongoing challenges and proposes future directions for enhancing the accessibility and impact of foundation models in healthcare.

Supplementary information: The online version contains supplementary material available at 10.1007/s13534-025-00484-6.

Keywords: Deep learning; Foundation model; Medical imaging; Vision-language model.

PubMed Disclaimer

Figures

**Fig. 1**
Distribution of foundation model in medical field. The diagrams provide an analysis of the training datasets utilized in the reviewed studies. Each subfigure illustrates the distribution of key aspects: a imaging modalities, b target classifications, c organs of focus, and d data sources. The total number of papers included in the analysis is 61

**Fig. 2**
Organization of the review paper. The proposed taxonomy organizes foundational models in medical field into two broad categories. Specific-domain transfer applications, which include X-ray, CT, fundus imaging, MRI, and other medical imaging types. Multi-domain integrated applications, which combine insights across multiple imaging modalities

**Fig. 3**
Detailed illustration of model architecture. a Encoder-based cross-modal alignment method employs separate encoders for images and text, aligning their embeddings across modalities to facilitate integration. b In encoder-based multi-modal attention, both image and text inputs are processed within a unified model, using the encoder alone to execute tasks. c Encoder–decoder-based multi-modal integration combines images and text as simultaneous joint inputs to the encoder, adopting a generative approach for decoding outputs. d In another encoder–decoder-based multi-modal integration approach, text serves as a conditional prompt, directing the generation process by attention-based mechanisms

See this image and copyright information in PMC

References

1. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021. arXiv:2108.07258.
1. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. Int Conf Mach Learn. 2021;139:8748–63.
1. Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, et al. Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF international conference on computer vision; 2021; 9650–60.
1. Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020;1. arXiv:2005.14165.
1. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901.

Publication types

Actions

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Vision-language foundation models for medical imaging: a review of current practices and innovations

Affiliations

Vision-language foundation models for medical imaging: a review of current practices and innovations

Authors

Affiliations

Abstract

Figures

References

Publication types

LinkOut - more resources

Full Text Sources