Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 17;16(9):e69612.
doi: 10.7759/cureus.69612. eCollection 2024 Sep.

Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks

Affiliations

Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks

Gurnoor S Gill et al. Cureus. .

Abstract

Background With advancements in natural language processing, tools such as Chat Generative Pre-Trained Transformers (ChatGPT) version 4.0 and Google Bard's Gemini Advanced are being increasingly evaluated for their potential in various medical applications. The objective of this study was to systematically assess the performance of these language learning models (LLMs) on both image and non-image-based questions within the specialized field of Ophthalmology. We used a review question bank for the Ophthalmic Knowledge Assessment Program (OKAP) used by ophthalmology residents nationally to prepare for the Ophthalmology Board Exam to assess the accuracy and performance of ChatGPT and Gemini Advanced. Methodology A total of 260 randomly generated multiple-choice questions from the OphthoQuestions question bank were run through ChatGPT and Gemini Advanced. A simulated 260-question OKAP examination was created at random from the bank. Question-specific data were analyzed, including overall percent correct, subspecialty accuracy, whether the question was "high yield," difficulty (1-4), and question type (e.g., image, text). To compare the performance of ChatGPT-4 and Gemini on the difficulty of questions, we utilized the standard deviation of user answer choices to determine question difficulty. In this study, a statistical analysis of Google Sheets was conducted using two-tailed t-tests with unequal variance to compare the performance of ChatGPT-4.0 and Google's Gemini Advanced across various question types, subspecialties, and difficulty levels. Results In total, 259 of the 260 questions were included in the study as one question used a video that any form of ChatGPT could not interpret as of May 1, 2024. For text-only questions, ChatGPT-4.0.0 correctly answered 57.14% (148/259, p < 0.018), and Gemini Advanced correctly answered 46.72% (121/259, p < 0.018). Both versions answered most questions without a prompt and would have received a below-average score on the OKAP. Moreover, there were 27 questions utilizing a secondary prompt in ChatGPT-4.0 compared to 67 questions in Gemini Advanced. ChatGPT-4.0 performed 68.99% on easier questions (<2 on a scale from 1-4) and 44.96% on harder questions (>2 on a scale from 1-4). On the other hand, Gemini Advanced performed 49.61% on easier questions (<2 on a scale from 1-4) and 44.19% on harder questions (>2 on a scale from 1-4). There was a statistically significant difference in accuracy between ChatGPT-4.0 and Gemini Advanced for easy (p < 0.0015) but not for hard (p < 0.55) questions. For image-only questions, ChatGPT-4.0 correctly answered 39.58% (19/48, p < 0.013), and Gemini Advanced correctly answered 33.33% (16/48, p < 0.022), resulting in a statistically insignificant difference in accuracy between ChatGPT-4.0 and Gemini Advanced (p < 0.530). A comparison against text-only and image-based questions demonstrated a statistically significant difference in accuracy for both ChatGPT-4.0 (p < 0.013) and Gemini Advanced (p < 0.022). Conclusions This study provides evidence that ChatGPT-4.0 performs better on the OKAP-style exams and is improved compared to Gemini Advanced within the context of ophthalmic multiple-choice questions. This may show an opportunity for increased worth for ChatGPT in ophthalmic medical education. While showing promise within medical education, caution should be used as a more detailed evaluation of reliability is needed.

Keywords: artificial intelligence in education; artificial intelligence in medicine; chatgpt; customized medical education; general ophthalmology; large language model (llm); ophthalmology.

PubMed Disclaimer

Conflict of interest statement

Human subjects: All authors have confirmed that this study did not involve human participants or tissue. Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue. Conflicts of interest: In compliance with the ICMJE uniform disclosure form, all authors declare the following: Payment/services info: All authors have declared that no financial support was received from any organization for the submitted work. Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.

Figures

Figure 1
Figure 1. Histogram of standard deviation in human answers for Ophthalmic Knowledge Assessment Program (OKAP) questions.
Figure 2
Figure 2. Comparison of GPT-4 and Gemini Advanced percentage correct across various question types.
Figure 3
Figure 3. Comparison of GPT-4 and Gemini Advanced percentage correct across question difficulty.
Figure 4
Figure 4. Comparison of GPT-4 and Gemini Advanced percentage correct across ophthalmology subspecialties.

References

    1. Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment. Mihalache A, Grad J, Patil NS, et al. Eye (Lond) 2024;38:2530–2535. - PMC - PubMed
    1. Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison. Carlà MM, Gambini G, Baldascino A, Boselli F, Giannuzzi F, Margollicci F, Rizzo S. Graefes Arch Clin Exp Ophthalmol. 2024;262:2945–2959. - PMC - PubMed
    1. GPT-4 artificial intelligence model outperforms ChatGPT, medical students, and neurosurgery residents on neurosurgery written board-like questions. Guerra GA, Hofmann H, Sobhani S, et al. World Neurosurg. 2023;179:0–5. - PubMed
    1. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. JMIR Med Educ. 2023;9:0. - PMC - PubMed
    1. Performance of Google's artificial intelligence chatbot "Bard" (now "Gemini") on Ophthalmology Board Exam practice questions. Botross M, Mohammadi SO, Montgomery K, Crawford C. Cureus. 2024;16:0. - PMC - PubMed

LinkOut - more resources