Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 1;142(6):573-576.
doi: 10.1001/jamaophthalmol.2024.1165.

Vision-Language Models for Feature Detection of Macular Diseases on Optical Coherence Tomography

Affiliations

Vision-Language Models for Feature Detection of Macular Diseases on Optical Coherence Tomography

Fares Antaki et al. JAMA Ophthalmol. .

Abstract

Importance: Vision-language models (VLMs) are a novel artificial intelligence technology capable of processing image and text inputs. While demonstrating strong generalist capabilities, their performance in ophthalmology has not been extensively studied.

Objective: To assess the performance of the Gemini Pro VLM in expert-level tasks for macular diseases from optical coherence tomography (OCT) scans.

Design, setting, and participants: This was a cross-sectional diagnostic accuracy study evaluating a generalist VLM on ophthalmology-specific tasks using the open-source Optical Coherence Tomography Image Database. The dataset included OCT B-scans from 50 unique patients: healthy individuals and those with macular hole, diabetic macular edema, central serous chorioretinopathy, and age-related macular degeneration. Each OCT scan was labeled for 10 key pathological features, referral recommendations, and treatments. The images were captured using a Cirrus high definition OCT machine (Carl Zeiss Meditec) at Sankara Nethralaya Eye Hospital, Chennai, India, and the dataset was published in December 2018. Image acquisition dates were not specified.

Exposures: Gemini Pro, using a standard prompt to extract structured responses on December 15, 2023.

Main outcomes and measures: The primary outcome was model responses compared against expert labels, calculating F1 scores for each pathological feature. Secondary outcomes included accuracy in diagnosis, referral urgency, and treatment recommendation. The model's internal concordance was evaluated by measuring the alignment between referral and treatment recommendations, independent of diagnostic accuracy.

Results: The mean F1 score was 10.7% (95% CI, 2.4-19.2). Measurable F1 scores were obtained for macular hole (36.4%; 95% CI, 0-71.4), pigment epithelial detachment (26.1%; 95% CI, 0-46.2), subretinal hyperreflective material (24.0%; 95% CI, 0-45.2), and subretinal fluid (20.0%; 95% CI, 0-45.5). A correct diagnosis was achieved in 17 of 50 cases (34%; 95% CI, 22-48). Referral recommendations varied: 28 of 50 were correct (56%; 95% CI, 42-70), 10 of 50 were overcautious (20%; 95% CI, 10-32), and 12 of 50 were undercautious (24%; 95% CI, 12-36). Referral and treatment concordance were very high, with 48 of 50 (96%; 95 % CI, 90-100) and 48 of 49 (98%; 95% CI, 94-100) correct answers, respectively.

Conclusions and relevance: In this study, a generalist VLM demonstrated limited vision capabilities for feature detection and management of macular disease. However, it showed low self-contradiction, suggesting strong language capabilities. As VLMs continue to improve, validating their performance on large benchmarking datasets will help ascertain their potential in ophthalmology.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Disclosures: Dr Chopra reported previous employment at Google. Dr Keane reported consulting fees from Google DeepMind during the conduct of the study as well as consulting fees from Roche, Novartis, Bayer, Boehringer Ingelheim, and Apellis and other from Bitfount (stock options) Big Picture Medical (equity) outside the submitted work. No other disclosures were reported.

Figures

Figure.
Figure.. Overview of the Prompting Strategy and Response Evaluation
JSON indicates JavaScript Object Notation; OCT, optical coherence tomography; TN, true negative; TP, true positive; VLM, vision-language model.

References

    1. Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3(4):100324. doi:10.1016/j.xops.2023.100324 - DOI - PMC - PubMed
    1. Antaki F, Milad D, Chia MA, et al. . Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. Br J Ophthalmol. Published online November 3, 2023. doi:10.1136/bjo-2023-324438 - DOI - PubMed
    1. Yang Z, Li L, Lin K, et al. . The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv. Posted online September 29, 2023. http://arxiv.org/abs/2309.17421
    1. Gemini Team Google . Gemini: a family of highly capable multimodal models. arXiv. Posted online December 19, 2023. http://arxiv.org/abs/2312.11805
    1. Gholami P, Roy P, Parthasarathy MK, Lakshminarayanan V. OCTID: optical coherence tomography image database. Comput Electr Eng. 2020;81:106532. doi:10.1016/j.compeleceng.2019.106532 - DOI

MeSH terms