. 2025 Jun 21;8(1):381.

doi: 10.1038/s41746-025-01772-2.

A multimodal visual-language foundation model for computational ophthalmology

Danli Shi^#^{1

2}, Weiyi Zhang^#³, Jiancheng Yang⁴, Siyu Huang⁵, Xiaolan Chen³, Pusheng Xu³, Kai Jin⁶, Shan Lin⁷, Jin Wei⁸, Mayinuer Yusufu⁹, Shunming Liu¹⁰, Qing Zhang¹¹, Zongyuan Ge¹², Xun Xu⁸, Mingguang He^{13

14

15}

Affiliations

¹ School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China. danli.shi@polyu.edu.hk.
² Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China. danli.shi@polyu.edu.hk.
³ School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China.
⁴ Swiss Federal Institute of Technology Lausanne (EPFL), Lausanne, Switzerland.
⁵ School of Computing, Clemson University, Clemson, SC, USA.
⁶ Department of Ophthalmology, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China.
⁷ Wuhan Bright Eye Hospital, Wuhan, China.
⁸ Department of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, National Clinical Research Center for Eye Diseases, No. 100 Haining Road, Shanghai, 20080, PR China.
⁹ Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, Australia.
¹⁰ Department of Ophthalmology, Guangdong Academy of Medical Sciences, Guangdong Provincial People's Hospital, Guangzhou, China.
¹¹ Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Beijing, China.
¹² AIM for Health Lab, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia.
¹³ School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China. mingguang.he@polyu.edu.hk.
¹⁴ Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China. mingguang.he@polyu.edu.hk.
¹⁵ Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Science Park, Hong Kong SAR, China. mingguang.he@polyu.edu.hk.

^# Contributed equally.

PMID: 40542189
PMCID: PMC12181238
DOI: 10.1038/s41746-025-01772-2

A multimodal visual-language foundation model for computational ophthalmology

Danli Shi et al. NPJ Digit Med. 2025.

. 2025 Jun 21;8(1):381.

doi: 10.1038/s41746-025-01772-2.

Authors

Affiliations

¹ School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China. danli.shi@polyu.edu.hk.
² Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China. danli.shi@polyu.edu.hk.
³ School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China.
⁴ Swiss Federal Institute of Technology Lausanne (EPFL), Lausanne, Switzerland.
⁵ School of Computing, Clemson University, Clemson, SC, USA.
⁶ Department of Ophthalmology, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China.
⁷ Wuhan Bright Eye Hospital, Wuhan, China.
⁸ Department of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, National Clinical Research Center for Eye Diseases, No. 100 Haining Road, Shanghai, 20080, PR China.
⁹ Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, Australia.
¹⁰ Department of Ophthalmology, Guangdong Academy of Medical Sciences, Guangdong Provincial People's Hospital, Guangzhou, China.
¹¹ Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Beijing, China.
¹² AIM for Health Lab, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia.
¹³ School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China. mingguang.he@polyu.edu.hk.
¹⁴ Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China. mingguang.he@polyu.edu.hk.
¹⁵ Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Science Park, Hong Kong SAR, China. mingguang.he@polyu.edu.hk.

^# Contributed equally.

PMID: 40542189
PMCID: PMC12181238
DOI: 10.1038/s41746-025-01772-2

Abstract

Early detection of eye diseases is vital for preventing vision loss. Existing ophthalmic artificial intelligence models focus on single modalities, overlooking multi-view information and struggling with rare diseases due to long-tail distributions. We propose EyeCLIP, a multimodal visual-language foundation model trained on 2.77 million ophthalmology images from 11 modalities with partial clinical text. Our novel pretraining strategy combines self-supervised reconstruction, multimodal image contrastive learning, and image-text contrastive learning to capture shared representations across modalities. EyeCLIP demonstrates robust performance across 14 benchmark datasets, excelling in disease classification, visual question answering, and cross-modal retrieval. It also exhibits strong few-shot and zero-shot capabilities, enabling accurate predictions in real-world, long-tail scenarios. EyeCLIP offers significant potential for detecting both ocular and systemic diseases, and bridging gaps in real-world clinical applications.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Study diagram.**
a Using an extensive multimodal database across nine provinces in China, we matched the multi-examination images from the same patient, and cleaned the medical reports using a keyword mapping dictionary containing medical terminology to generate hierarchical keyword text labels. b EyeCLIP was pretrained using self-supervised reconstruction, multi-examination contrastive learning, and hierarchical text-image contrastive learning to leverage real-world multi-examination clinical data fully. c Downstream multi-country datasets for EyeCLIP validation, including zero-shot, few-shot, and supervised finetuning scenarios. d Radar plot outlines the performance of EyeCLIP and baseline models across various downstream tasks. EyeCLIP significantly outperforms the baseline models across diverse tasks, including zero-shot classification, multimodal retrieval, visual question answering (VQA), and supervised systemic disease prediction.

**Fig. 2. Zero-shot performance on downstream ocular diseases datasets.**
a AUROC. b AUPR. Error bars represent 95% confidence intervals, and the centers correspond to computed values of each metric. EyeCLIP achieved significantly better zero-shot performance than other models for both AUROC and AUPR. AUROC = area under the receiver operator characteristic curve, AUPR = area under the precision-recall curve. EyeCLIP outperforms the second-best model FLAIR, a pretrained vision-language model for universal retinal fundus image understanding. Notably, FLAIR was pretrained on public datasets, with its performance evaluated through internal validation. In contrast, EyeCLIP, which was not trained on these public datasets, demonstrated its performance through external validation, highlighting its strong generalizability.

**Fig. 3. Few-shot classification experiments.**
We investigated the label efficiency of different pretrained models in a few-shot setting, varying the number of training labels per class (nc = 1, 2, 4, 8, 16) in the APTOS2019 (a), MESSIDOR2 (b), IDRID (c), GLAUCOMA FUNDUS (d), PAPILA (e), JSIEC (f), RETINA (g), OCTDL (h), and OCTID (i) dataset. For each nc, we sampled five different sets of training examples and trained a weakly supervised model. Boxes indicate quartile values, and whiskers extend to data points within 1.5× the interquartile range. EyeCLIP achieves significantly better performance (in terms of the mean AUROC of five runs) than other encoders for different sizes of training sets and across all datasets. AUROC = area under the receiver operator characteristic curve. AUPR results can be found in Supplementary Fig. 1.

**Fig. 4. Performance of EyeCLIP across ocular, systemic, and rare disease prediction tasks.**
a Supervised full-data finetuning on ocular disease tasks. EyeCLIP is on par with the 2^nd best model RETFound on APTOS2019, MESSIDOR2, OCTID (P > 0.05), and surpasses all models on the other eight datasets. b Supervised full-data finetuning on systemic disease prediction. EyeCLIP surpasses all other models. (P < 0.05). c Few-shot finetuning on rare disease classification. EyeCLIP surpasses all other models. (P < 0.05). Boxes indicate quartile values, and whiskers extend to data points within 1.5× the interquartile range. Detailed statistics can be found in Supplementary Tables 4-5. AUROC = area under the receiver operator characteristic curve, *AUPR* area under the precision-recall curve.

**Fig. 5. Zero-shot multimodal retrieval performance.**
a Model comparison on two datasets with image-text pairs, AngioReport and Retina Image Bank. Similarity in the embedding space was computed between the query image and all text samples in the database. The top-K most similar texts were retrieved. We report Recall@K for K ∈ {1, 5, 10} and the mean recall, which averages over K. We compared different models in text-to-image (1^st column), image-to-image (2^nd column) and image-to-text (3^rd column). EyeCLIP outperforms other baselines on all retrieval tasks. Error bars indicate 95% confidence intervals. b Schematic illustrates zero-shot cross-modal retrieval. c, d Examples of images in the top one retrieved result from the Retina Image Bank. More examples can be found in Supplementary Fig. 3.

See this image and copyright information in PMC

Cited by

Artificial Intelligence Improves Patient Follow-Up in a Diabetic Retinopathy Screening Program [Letter].
Bhavsar AP, Minerd CI, Piri N. Bhavsar AP, et al. Clin Ophthalmol. 2025 Aug 9;19:2659-2660. doi: 10.2147/OPTH.S537960. eCollection 2025. Clin Ophthalmol. 2025. PMID: 40823159 Free PMC article. No abstract available.

References

1. Burton, M. J. et al. The Lancet Global Health Commission on Global Eye Health: vision beyond 2020. Lancet Glob. Health9, e489–e551 (2021). - PMC - PubMed
1. Resnikoff, S. et al. Estimated number of ophthalmologists worldwide (International Council of Ophthalmology update): will we meet the needs? Br. J. Ophthalmol.104, 588–592 (2020). - PMC - PubMed
1. Ye, J., He, L. & Beestrum, M. Implications for implementation and adoption of telehealth in developing countries: a systematic review of China’s practices and experiences. NPJ Digital Med.6, 174 (2023). - PMC - PubMed
1. Cui, M. & Zhang, D. Y. Artificial intelligence and computational pathology. Lab. Investig.101, 412–422 (2021). - PMC - PubMed
1. Vorontsov, E. et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat. Med30, 2924–2935 (2024). - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A multimodal visual-language foundation model for computational ophthalmology

Affiliations

A multimodal visual-language foundation model for computational ophthalmology

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources