Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3

Fang-Fang Zhao^#¹, Han-Jie He^#^{1

2}, Jia-Jian Liang^#¹, Jingyun Cen³, Yun Wang¹, Hongjie Lin¹, Feifei Chen¹, Tai-Ping Li¹, Jian-Feng Yang¹, Lan Chen¹, Ling-Ping Cen⁴

Affiliations

¹ Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China.
² Shantou University Medical College, Shantou, Guangdong, China.
³ Shaoguan University Medical college, Shaoguan, China.
⁴ Guangdong Provincial Key Laboratory of Medical Immunology and Molecular Diagnostics, School of Medical Technology, Guangdong Medical University, Zhanjiang, China. cenlp@hotmail.com.

^# Contributed equally.

PMID: 39690303
PMCID: PMC11978972 (available on 2026-04-01)
DOI: 10.1038/s41433-024-03545-9

Comparative Study

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3

Fang-Fang Zhao et al. Eye (Lond). 2025 Apr.

. 2025 Apr;39(6):1132-1137.

doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.

Authors

Fang-Fang Zhao^#¹, Han-Jie He^#^{1

2}, Jia-Jian Liang^#¹, Jingyun Cen³, Yun Wang¹, Hongjie Lin¹, Feifei Chen¹, Tai-Ping Li¹, Jian-Feng Yang¹, Lan Chen¹, Ling-Ping Cen⁴

Affiliations

¹ Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China.
² Shantou University Medical College, Shantou, Guangdong, China.
³ Shaoguan University Medical college, Shaoguan, China.
⁴ Guangdong Provincial Key Laboratory of Medical Immunology and Molecular Diagnostics, School of Medical Technology, Guangdong Medical University, Zhanjiang, China. cenlp@hotmail.com.

^# Contributed equally.

PMID: 39690303
PMCID: PMC11978972 (available on 2026-04-01)
DOI: 10.1038/s41433-024-03545-9

Abstract

Background/objective: This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology.

Methods: Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05.

Results: Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3.

Conclusions: Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

References

1. Samant RM, Bachute MR, Gite S, Kotecha K. Framework for deep learning-based language models using multi-task learning in natural language understanding: a systematic literature review and future directions. IEEE Access. 2022;10:17078–97.
1. De Angelis L, Baglivo F, Arzilli G, Privitera GP, Ferragina P, Tozzi AE, et al. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front Public Health. 2023;11:1166120. - PMC - PubMed
1. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. N Engl J Med. 2023;388:1233–9. - PubMed
1. Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183:589–96. - PMC - PubMed
1. Biswas S, Davies LN, Sheppard AL, Logan NS, Wolffsohn JS. Utility of artificial intelligence‐based large language models in ophthalmic care. Ophthalmic Physiol Opt. 2024;44:641–71. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3

Affiliations

Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3

Authors

Affiliations

Abstract

Conflict of interest statement

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources