Comparative analysis of generic vision-language models in detecting and diagnosing inherited retinal diseases using fundus photographs
- PMID: 41057716
- DOI: 10.1038/s41433-025-04013-8
Comparative analysis of generic vision-language models in detecting and diagnosing inherited retinal diseases using fundus photographs
Abstract
Background: To evaluate the clinical applicability of three generic Vision-Large-Language Models (VLLMs) - OpenAI's GPT-4omni, GPT-4V(ision) and Google's Gemini in detecting and diagnosing inherited retinal diseases (IRDs), using fundus photographs.
Methods: The head-to-head comparative study curated 60 ultra-widefield (UWF) fundus images of 30 IRD patients from the National University Hospital, Singapore. Additionally, ten normal, open-sourced UWF fundus images were included for comparison. The 70 fundus images were analysed by the three VLLMs using standardised prompts to generate descriptions of 10 specified retinal features and provide clinical insights. Each VLLM received 2100 scores for descriptions across ten features, rated by three blinded consultant-level graders using three-point scale (0 = poor, 1 = borderline, 2 = good). Clinical insights including disease detection, diagnosis and pathological gene inference evaluated against clinical ground-truth.
Results: GPT-4o achieved the highest mean quality score in feature description (1.64 [0.697], mean [SEM]), outperforming GPT-4V (1.57 [0.738]) and Gemini (1.46 [0.800]; both p < 0.001). All models demonstrated high detection accuracy ( 81.4%), but Gemini incorrectly classified all normal fundus images as IRD. GPT-4omni (65.7%) outperformed GPT-4V (50%) and Gemini (60%) in diagnosis accuracy. Gene inference precision remained low ( 20.3%) across all models. High concordance was observed across all models between feature descriptions and diagnoses ( 97.1%), between diagnoses and clinical recommendations (100%).
Conclusions: GPT-4omni and GPT-4V demonstrated promising potential in detecting IRDs from fundus photographs, with good feature extraction capabilities and high detection accuracy. Gemini struggled with misidentifying normal fundus images. All three VLLMs require further refinement to improve diagnostic accuracy and gene inference.
© 2025. The Author(s), under exclusive licence to The Royal College of Ophthalmologists.
Conflict of interest statement
Competing interests: The authors declare no competing interests.
References
-
- Liévin V, Hotherc E, Motzfeldt AG, Winther O. Can large language models reason about medical questions?. ArXiv. 2023. https://arxiv.org/abs/2207.08143 .
-
- Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198.
-
- Measuring performance on the Healthcare Access and Quality Index for 195 countries and territories and selected subnational locations: a systematic analysis from the Global Burden of Disease Study 2016. Lancet. 2018;391:2236–71.
Grants and funding
LinkOut - more resources
Full Text Sources