Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 7.
doi: 10.1038/s41433-025-04013-8. Online ahead of print.

Comparative analysis of generic vision-language models in detecting and diagnosing inherited retinal diseases using fundus photographs

Affiliations

Comparative analysis of generic vision-language models in detecting and diagnosing inherited retinal diseases using fundus photographs

Xiang Meng et al. Eye (Lond). .

Abstract

Background: To evaluate the clinical applicability of three generic Vision-Large-Language Models (VLLMs) - OpenAI's GPT-4omni, GPT-4V(ision) and Google's Gemini in detecting and diagnosing inherited retinal diseases (IRDs), using fundus photographs.

Methods: The head-to-head comparative study curated 60 ultra-widefield (UWF) fundus images of 30 IRD patients from the National University Hospital, Singapore. Additionally, ten normal, open-sourced UWF fundus images were included for comparison. The 70 fundus images were analysed by the three VLLMs using standardised prompts to generate descriptions of 10 specified retinal features and provide clinical insights. Each VLLM received 2100 scores for descriptions across ten features, rated by three blinded consultant-level graders using three-point scale (0 = poor, 1 = borderline, 2 = good). Clinical insights including disease detection, diagnosis and pathological gene inference evaluated against clinical ground-truth.

Results: GPT-4o achieved the highest mean quality score in feature description (1.64 [0.697], mean [SEM]), outperforming GPT-4V (1.57 [0.738]) and Gemini (1.46 [0.800]; both p < 0.001). All models demonstrated high detection accuracy ( 81.4%), but Gemini incorrectly classified all normal fundus images as IRD. GPT-4omni (65.7%) outperformed GPT-4V (50%) and Gemini (60%) in diagnosis accuracy. Gene inference precision remained low ( 20.3%) across all models. High concordance was observed across all models between feature descriptions and diagnoses ( 97.1%), between diagnoses and clinical recommendations (100%).

Conclusions: GPT-4omni and GPT-4V demonstrated promising potential in detecting IRDs from fundus photographs, with good feature extraction capabilities and high detection accuracy. Gemini struggled with misidentifying normal fundus images. All three VLLMs require further refinement to improve diagnostic accuracy and gene inference.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

References

    1. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–80. - DOI - PubMed - PMC
    1. Liévin V, Hotherc E, Motzfeldt AG, Winther O. Can large language models reason about medical questions?. ArXiv. 2023. https://arxiv.org/abs/2207.08143 .
    1. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198.
    1. Antaki F, Milad D, Chia MA, Giguère C, Touma S, El-Khoury J, et al. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. Br J Ophthalmol. 2024;108:1371–8. - DOI - PubMed
    1. Measuring performance on the Healthcare Access and Quality Index for 195 countries and territories and selected subnational locations: a systematic analysis from the Global Burden of Disease Study 2016. Lancet. 2018;391:2236–71.

LinkOut - more resources