Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 22;15(6):e40822.
doi: 10.7759/cureus.40822. eCollection 2023 Jun.

Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions

Affiliations

Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions

Majid Moshirfar et al. Cureus. .

Abstract

Importance Chat Generative Pre-Trained Transformer (ChatGPT) has shown promising performance in various fields, including medicine, business, and law, but its accuracy in specialty-specific medical questions, particularly in ophthalmology, is still uncertain. Purpose This study evaluates the performance of two ChatGPT models (GPT-3.5 and GPT-4) and human professionals in answering ophthalmology questions from the StatPearls question bank, assessing their outcomes, and providing insights into the integration of artificial intelligence (AI) technology in ophthalmology. Methods ChatGPT's performance was evaluated using 467 ophthalmology questions from the StatPearls question bank. These questions were stratified into 11 subcategories, four difficulty levels, and three generalized anatomical categories. The answer accuracy of GPT-3.5, GPT-4, and human participants was assessed. Statistical analysis was conducted via the Kolmogorov-Smirnov test for normality, one-way analysis of variance (ANOVA) for the statistical significance of GPT-3 versus GPT-4 versus human performance, and repeated unpaired two-sample t-tests to compare the means of two groups. Results GPT-4 outperformed both GPT-3.5 and human professionals on ophthalmology StatPearls questions, except in the "Lens and Cataract" category. The performance differences were statistically significant overall, with GPT-4 achieving higher accuracy (73.2%) compared to GPT-3.5 (55.5%, p-value < 0.001) and humans (58.3%, p-value < 0.001). There were variations in performance across difficulty levels (rated one to four), but GPT-4 consistently performed better than both GPT-3.5 and humans on level-two, -three, and -four questions. On questions of level-four difficulty, human performance significantly exceeded that of GPT-3.5 (p = 0.008). Conclusion The study's findings demonstrate GPT-4's significant performance improvements over GPT-3.5 and human professionals on StatPearls ophthalmology questions. Our results highlight the potential of advanced conversational AI systems to be utilized as important tools in the education and practice of medicine.

Keywords: artificial intelligence; chatbot; chatgpt-3.5; chatgpt-4; clinical decision-making; conversational ai; conversational generative pre-trained transformer; cornea; ophthalmology; statpearls.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Comparing the performance of GPT-3.5, GPT-4, and human professionals on StatPearls questions divided into ophthalmology sub-categories
* indicates statistical significance
Figure 2
Figure 2. Comparing the performance of GPT-3.5, GPT-4, and humans on StatPearls questions divided into generalized anatomically based categories
The “anterior segment” included cornea, cataract, and refractive surgery categories; the “posterior segment” included the retina and vitreous category; the “other” category was comprised of neuro-ophthalmology, pediatrics, and oculoplastics. Questions from the glaucoma, pathology, and uveitis categories were individually divided amongst the “anterior,” “posterior,” and “other” categories according to question content. *, ** indicates statistical significance
Figure 3
Figure 3. Comparing the performance of GPT-3.5, GPT-4, and humans on StatPearls questions divided by difficulty levels
Level 1 indicated the “basic” difficulty level and tested recall; Level 2 indicated “moderate” difficulty and tested the ability to comprehend basic facts; Level 3 was described as “difficult” and tested application, or knowledge use in care; Level 4 was considered an “expert” high-complexity question and tested analysis and evaluation skills. *, **, † indicates statistical significance

References

    1. OpenAI. Product. [ Jan; 2023 ]. 2023. https://openai.com/product https://openai.com/product
    1. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. Kung TH, Cheatham M, Medenilla A, et al. PLOS Digit Health. 2023;2:0. - PMC - PubMed
    1. ChatGPT goes to law school. Choi JH, Hickman KE, Monahan A, et al. https://deliverypdf.ssrn.com/delivery.php?ID=631024088114108115116064085... J Leg Educ. 2023;[Epub]
    1. Terwiesch C. Would ChatGPT get a Wharton MBA? New white paper by Christian Terwiesch. Mack Institute for Innovation Management at the Wharton School, University of Pennsylvania. Mack Institute for Innovation Management at the Wharton School, University of Pennsylvania. 2023. https://mackinstitute.wharton.upenn.edu/2023/would-chat-gpt3-get-a-whart... https://mackinstitute.wharton.upenn.edu/2023/would-chat-gpt3-get-a-whart...
    1. Performance of ChatGPT on the plastic surgery inservice training examination. Gupta R, Herzog I, Park JB, et al. Aesthet Surg J. 2023:0. - PubMed