Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2025 Jun 6;15(5):809-830.
doi: 10.1007/s13534-025-00484-6. eCollection 2025 Sep.

Vision-language foundation models for medical imaging: a review of current practices and innovations

Affiliations
Review

Vision-language foundation models for medical imaging: a review of current practices and innovations

Ji Seung Ryu et al. Biomed Eng Lett. .

Abstract

Foundation models, including large language models and vision-language models (VLMs), have revolutionized artificial intelligence by enabling efficient, scalable, and multimodal learning across diverse applications. By leveraging advancements in self-supervised and semi-supervised learning, these models integrate computer vision and natural language processing to address complex tasks, such as disease classification, segmentation, cross-modal retrieval, and automated report generation. Their ability to pretrain on vast, uncurated datasets minimizes reliance on annotated data while improving generalization and adaptability for a wide range of downstream tasks. In the medical domain, foundation models address critical challenges by combining the information from various medical imaging modalities with textual data from radiology reports and clinical notes. This integration has enabled the development of tools that streamline diagnostic workflows, enhance accuracy (ACC), and enable robust decision-making. This review provides a systematic examination of the recent advancements in medical VLMs from 2022 to 2024, focusing on modality-specific approaches and tailored applications in medical imaging. The key contributions include the creation of a structured taxonomy to categorize existing models, an in-depth analysis of datasets essential for training and evaluation, and a review of practical applications. This review also addresses ongoing challenges and proposes future directions for enhancing the accessibility and impact of foundation models in healthcare.

Supplementary information: The online version contains supplementary material available at 10.1007/s13534-025-00484-6.

Keywords: Deep learning; Foundation model; Medical imaging; Vision-language model.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Distribution of foundation model in medical field. The diagrams provide an analysis of the training datasets utilized in the reviewed studies. Each subfigure illustrates the distribution of key aspects: a imaging modalities, b target classifications, c organs of focus, and d data sources. The total number of papers included in the analysis is 61
Fig. 2
Fig. 2
Organization of the review paper. The proposed taxonomy organizes foundational models in medical field into two broad categories. Specific-domain transfer applications, which include X-ray, CT, fundus imaging, MRI, and other medical imaging types. Multi-domain integrated applications, which combine insights across multiple imaging modalities
Fig. 3
Fig. 3
Detailed illustration of model architecture. a Encoder-based cross-modal alignment method employs separate encoders for images and text, aligning their embeddings across modalities to facilitate integration. b In encoder-based multi-modal attention, both image and text inputs are processed within a unified model, using the encoder alone to execute tasks. c Encoder–decoder-based multi-modal integration combines images and text as simultaneous joint inputs to the encoder, adopting a generative approach for decoding outputs. d In another encoder–decoder-based multi-modal integration approach, text serves as a conditional prompt, directing the generation process by attention-based mechanisms

References

    1. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021. arXiv:2108.07258.
    1. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. Int Conf Mach Learn. 2021;139:8748–63.
    1. Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, et al. Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF international conference on computer vision; 2021; 9650–60.
    1. Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020;1. arXiv:2005.14165.
    1. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901.

LinkOut - more resources