Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3
- PMID: 39690303
- PMCID: PMC11978972
- DOI: 10.1038/s41433-024-03545-9
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3
Abstract
Background/objective: This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology.
Methods: Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05.
Results: Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3.
Conclusions: Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.
© 2024. The Author(s), under exclusive licence to The Royal College of Ophthalmologists.
Conflict of interest statement
Competing interests: The authors declare no competing interests.
References
-
- Samant RM, Bachute MR, Gite S, Kotecha K. Framework for deep learning-based language models using multi-task learning in natural language understanding: a systematic literature review and future directions. IEEE Access. 2022;10:17078–97.
-
- Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. N Engl J Med. 2023;388:1233–9. - PubMed
-
- Biswas S, Davies LN, Sheppard AL, Logan NS, Wolffsohn JS. Utility of artificial intelligence‐based large language models in ophthalmic care. Ophthalmic Physiol Opt. 2024;44:641–71. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
