Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 21;8(1):381.
doi: 10.1038/s41746-025-01772-2.

A multimodal visual-language foundation model for computational ophthalmology

Affiliations

A multimodal visual-language foundation model for computational ophthalmology

Danli Shi et al. NPJ Digit Med. .

Abstract

Early detection of eye diseases is vital for preventing vision loss. Existing ophthalmic artificial intelligence models focus on single modalities, overlooking multi-view information and struggling with rare diseases due to long-tail distributions. We propose EyeCLIP, a multimodal visual-language foundation model trained on 2.77 million ophthalmology images from 11 modalities with partial clinical text. Our novel pretraining strategy combines self-supervised reconstruction, multimodal image contrastive learning, and image-text contrastive learning to capture shared representations across modalities. EyeCLIP demonstrates robust performance across 14 benchmark datasets, excelling in disease classification, visual question answering, and cross-modal retrieval. It also exhibits strong few-shot and zero-shot capabilities, enabling accurate predictions in real-world, long-tail scenarios. EyeCLIP offers significant potential for detecting both ocular and systemic diseases, and bridging gaps in real-world clinical applications.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Study diagram.
a Using an extensive multimodal database across nine provinces in China, we matched the multi-examination images from the same patient, and cleaned the medical reports using a keyword mapping dictionary containing medical terminology to generate hierarchical keyword text labels. b EyeCLIP was pretrained using self-supervised reconstruction, multi-examination contrastive learning, and hierarchical text-image contrastive learning to leverage real-world multi-examination clinical data fully. c Downstream multi-country datasets for EyeCLIP validation, including zero-shot, few-shot, and supervised finetuning scenarios. d Radar plot outlines the performance of EyeCLIP and baseline models across various downstream tasks. EyeCLIP significantly outperforms the baseline models across diverse tasks, including zero-shot classification, multimodal retrieval, visual question answering (VQA), and supervised systemic disease prediction.
Fig. 2
Fig. 2. Zero-shot performance on downstream ocular diseases datasets.
a AUROC. b AUPR. Error bars represent 95% confidence intervals, and the centers correspond to computed values of each metric. EyeCLIP achieved significantly better zero-shot performance than other models for both AUROC and AUPR. AUROC = area under the receiver operator characteristic curve, AUPR = area under the precision-recall curve. EyeCLIP outperforms the second-best model FLAIR, a pretrained vision-language model for universal retinal fundus image understanding. Notably, FLAIR was pretrained on public datasets, with its performance evaluated through internal validation. In contrast, EyeCLIP, which was not trained on these public datasets, demonstrated its performance through external validation, highlighting its strong generalizability.
Fig. 3
Fig. 3. Few-shot classification experiments.
We investigated the label efficiency of different pretrained models in a few-shot setting, varying the number of training labels per class (nc = 1, 2, 4, 8, 16) in the APTOS2019 (a), MESSIDOR2 (b), IDRID (c), GLAUCOMA FUNDUS (d), PAPILA (e), JSIEC (f), RETINA (g), OCTDL (h), and OCTID (i) dataset. For each nc, we sampled five different sets of training examples and trained a weakly supervised model. Boxes indicate quartile values, and whiskers extend to data points within 1.5× the interquartile range. EyeCLIP achieves significantly better performance (in terms of the mean AUROC of five runs) than other encoders for different sizes of training sets and across all datasets. AUROC = area under the receiver operator characteristic curve. AUPR results can be found in Supplementary Fig. 1.
Fig. 4
Fig. 4. Performance of EyeCLIP across ocular, systemic, and rare disease prediction tasks.
a Supervised full-data finetuning on ocular disease tasks. EyeCLIP is on par with the 2nd best model RETFound on APTOS2019, MESSIDOR2, OCTID (P > 0.05), and surpasses all models on the other eight datasets. b Supervised full-data finetuning on systemic disease prediction. EyeCLIP surpasses all other models. (P < 0.05). c Few-shot finetuning on rare disease classification. EyeCLIP surpasses all other models. (P < 0.05). Boxes indicate quartile values, and whiskers extend to data points within 1.5× the interquartile range. Detailed statistics can be found in Supplementary Tables 4-5. AUROC = area under the receiver operator characteristic curve, AUPR area under the precision-recall curve.
Fig. 5
Fig. 5. Zero-shot multimodal retrieval performance.
a Model comparison on two datasets with image-text pairs, AngioReport and Retina Image Bank. Similarity in the embedding space was computed between the query image and all text samples in the database. The top-K most similar texts were retrieved. We report Recall@K for K ∈ {1, 5, 10} and the mean recall, which averages over K. We compared different models in text-to-image (1st column), image-to-image (2nd column) and image-to-text (3rd column). EyeCLIP outperforms other baselines on all retrieval tasks. Error bars indicate 95% confidence intervals. b Schematic illustrates zero-shot cross-modal retrieval. c, d Examples of images in the top one retrieved result from the Retina Image Bank. More examples can be found in Supplementary Fig. 3.

Similar articles

Cited by

References

    1. Burton, M. J. et al. The Lancet Global Health Commission on Global Eye Health: vision beyond 2020. Lancet Glob. Health9, e489–e551 (2021). - PMC - PubMed
    1. Resnikoff, S. et al. Estimated number of ophthalmologists worldwide (International Council of Ophthalmology update): will we meet the needs? Br. J. Ophthalmol.104, 588–592 (2020). - PMC - PubMed
    1. Ye, J., He, L. & Beestrum, M. Implications for implementation and adoption of telehealth in developing countries: a systematic review of China’s practices and experiences. NPJ Digital Med.6, 174 (2023). - PMC - PubMed
    1. Cui, M. & Zhang, D. Y. Artificial intelligence and computational pathology. Lab. Investig.101, 412–422 (2021). - PMC - PubMed
    1. Vorontsov, E. et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat. Med30, 2924–2935 (2024). - PMC - PubMed

LinkOut - more resources