Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings

Fares Antaki^{1

2

3

4}, Samir Touma^{1

2

3}, Daniel Milad^{1

2

3}, Jonathan El-Khoury^{1

2

3}, Renaud Duval^{1

2}

Affiliations

¹ Department of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada.
² Centre Universitaire d'Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l'Est-de-l'Île-de-Montréal, Montréal, Quebec, Canada.
³ Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, Canada.
⁴ The CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, Canada.

PMID: 37334036
PMCID: PMC10272508
DOI: 10.1016/j.xops.2023.100324

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings

Fares Antaki et al. Ophthalmol Sci. 2023.

. 2023 May 5;3(4):100324.

doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.

Authors

Fares Antaki^{1

2

3

4}, Samir Touma^{1

2

3}, Daniel Milad^{1

2

3}, Jonathan El-Khoury^{1

2

3}, Renaud Duval^{1

2}

Affiliations

¹ Department of Ophthalmology, Université de Montréal, Montréal, Quebec, Canada.
² Centre Universitaire d'Ophtalmologie (CUO), Hôpital Maisonneuve-Rosemont, CIUSSS de l'Est-de-l'Île-de-Montréal, Montréal, Quebec, Canada.
³ Department of Ophthalmology, Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, Canada.
⁴ The CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l'Université de Montréal (CHUM), Montréal, Quebec, Canada.

PMID: 37334036
PMCID: PMC10272508
DOI: 10.1016/j.xops.2023.100324

Abstract

Purpose: Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space.

Design: Evaluation of diagnostic test or technology.

Participants: ChatGPT is a publicly available LLM.

Methods: We tested 2 versions of ChatGPT (January 9 "legacy" and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey's test to decide if there were meaningful differences between the tested subspecialties.

Main outcome measures: We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT's outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a P value of < 0.05.

Results: The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; P = 0.006) followed by question difficulty (LR, 24.05; P < 0.001) were most predictive of ChatGPT's answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology (P < 0.001) and ocular pathology (P = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections.

Conclusion: ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties.

Financial disclosures: Proprietary or commercial disclosure may be found after the references.

Keywords: Artificial intelligence; ChatGPT; Generative Pretrained Transformer; Medical education; Ophthalmology.

PubMed Disclaimer

Figures

**Figure 1**
Alluvial diagram illustrating the distribution of questions across examination sections, cognitive level, and question difficulty. Despite having been generated at random, the Basic and Clinical Science Course (BCSC) and OphthoQuestions test sets have a similar distribution of questions with high and low cognitive levels and similar difficulty.

**Figure 2**
Bar plot of the accuracy of ChatGPT across examination sections and ChatGPT models for the Basic and Clinical Science Course (BCSC) and OphthoQuestions testing sets. The ChatGPT Plus model accuracy is shown with error bars representing the standard deviation from the 3 experimental runs.

See this image and copyright information in PMC

References

1. Ting D.S.W., Pasquale L.R., Peng L., et al. Artificial intelligence and deep learning in ophthalmology. Br J Ophthalmol. 2019;103:167–175. - PMC - PubMed
1. Schmidt-Erfurth U., Sadeghipour A., Gerendas B.S., et al. Artificial intelligence in retina. Prog Retin Eye Res. 2018;67:1–29. - PubMed
1. Antaki F., Coussa R.G., Kahwati G., et al. Accuracy of automated machine learning in classifying retinal pathologies from ultra-widefield pseudocolour fundus images. Br J Ophthalmol. 2023;107:90–95. - PubMed
1. Nath S., Marie A., Ellershaw S., et al. New meaning for NLP: the trials and tribulations of natural language processing with GPT-3 in ophthalmology. Br J Ophthalmol. 2022;106:889–892. - PubMed
1. Topol E. When M.D. is a Machine Doctor. Available at: https://erictopol.substack.com/p/when-md-is-a-machine-doctor Accessed January 20, 2023.

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings

Affiliations

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources