Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2026 Jan 2:13:23821205251409499.
doi: 10.1177/23821205251409499. eCollection 2026 Jan-Dec.

Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates

Affiliations

Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates

Zijing Huang et al. J Med Educ Curric Dev. .

Abstract

Purpose: We aim to evaluate the performance of 5 large language models (LLMs) and human teachers in answering optometry-related questions raised by medical undergraduate students.

Methods: This prospective and comparative study collected 108 questions from 30 students. The questions were sent to their teachers for responses and were also inputted into 5 LLMs, including 2 local models (Mistral-7B and Llama-2-13B) and 3 online models (Claude-3, Gemini-1.0 pro, and GPT-4.0), to generate corresponding answers. All answers were independently evaluated by 2 optometry experts in a blind manner for accuracy, completeness, comprehensibility, and overall quality, using a 5-point scale. Students were asked to complete a 6-item questionnaire about their satisfaction and perspectives on the integration of LLMs.

Results: LLMs responded more quickly and generated more extensive answers compared to humans (P < .001). In terms of overall performance, human teachers ranked fifth among the 6 participants, with scores significantly lower than GPT-4.0 (P < .001), Claude-3 (P < .001), and Gemini-1.0 pro (P < .001). GPT-4.0 received the highest scores for accuracy (3.87/5) and completeness (4.11/5), while Claude-3 excelled in comprehensibility (3.91/5) and overall quality (3.93/5); however, the differences between them were not statistically significant. Online LLMs outperformed both humans and locally deployed LLMs (P < .001). Students agreed that LLMs provided more comprehensive and detailed information (3.80/5), but found human answers easier to understand (4.17/5). They were less supportive of replacing teachers with LLMs for answering questions (2.93/5).

Conclusion: Our findings demonstrate the potential of LLMs to serve as valuable tools in optometry education, particularly in addressing students' real-world questions.

Keywords: answering questions; artificial intelligence; large language models; medical education; optometry.

PubMed Disclaimer

Conflict of interest statement

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
Flowchart of the study.
Figure 2.
Figure 2.
Answering performance of large language models (LLMs) and human teachers. (A-D) Optometry experts’ rating for accuracy (A), completeness (B), comprehensibility (C), and overall assessment (D), on the answers provided by the human teachers and 5 LLMs. Data were presented as mean ± standard deviation and were analyzed using the least-significant difference (LSD) post-hoc test for pairwise multiple comparisons. The differences between the various LLMs were not marked in the figure.
Figure 3.
Figure 3.
Time spent answering questions (A) and word count of the answers (B) from human teachers and 5 LLMs. Data were presented as mean ± standard deviation and were analyzed using the least-significant difference (LSD) post-hoc test for pairwise multiple comparisons. The differences between the various LLMs were not marked in the figure.
Figure 4.
Figure 4.
A 6-item questionnaire for satisfaction and comments on the integration of large language models in the optometry course. The scoring results were shown in the right panel, presented as mean ± SD.

References

    1. Holden BA, Fricke TR, Wilson DA, et al. Global prevalence of myopia and high myopia and temporal trends from 2000 through 2050. Ophthalmology. 2016;123(5):1036–1042. doi: 10.1016/j.ophtha.2016.01.006 - DOI - PubMed
    1. The role of optometry in Vision 2020. Community Eye Health. 2002;15(43):33–36. - PMC - PubMed
    1. Gammoh Y, Morjaria P, Block SS, Massie J, Hendicott P. 2023 Global survey of optometry: defining variations of practice, regulation and human resources between countries. Clin Optom (Auckl). 2024;16:211–220. doi: 10.2147/OPTO.S481096 - DOI - PMC - PubMed
    1. Huang Z, Yang J, Wang H, Pang CP, Chen H. Comparison of digital camera real-time display with conventional teaching tube for slit lamp microscopy teaching. Curr Eye Res. 2022;47(1):161–164. doi: 10.1080/02713683.2021.1952606 - DOI - PubMed
    1. Wang H, Liao X, Zhang M, Pang CP, Chen H. Smartphone ophthalmoscope as a tool in teaching direct ophthalmoscopy: a crossover randomized controlled trial. Med Educ Online. 2023;28(1):2176201. doi: 10.1080/10872981.2023.2176201 - DOI - PMC - PubMed

LinkOut - more resources