AI in clinical decision-making: ChatGPT-4 vs. Llama2 for otolaryngology cases
- PMID: 40220179
- DOI: 10.1007/s00405-025-09371-3
AI in clinical decision-making: ChatGPT-4 vs. Llama2 for otolaryngology cases
Abstract
Purpose: To evaluate the diagnostic accuracy, appropriateness of additional examination recommendations, and consistency of therapeutic regimens by ChatGPT-4 and Llama2 based on real otolaryngology cases.
Methods: A prospective controlled study was conducted on 98 anonymized otolaryngology cases. Clinical information was entered in ChatGPT-4 and Llama2 for reaching primary diagnoses, additional examination recommendations, and treatment strategies. Two independent otolaryngologists evaluated the AI outputs using the artificial intelligence performance instrument (AIPI), evaluating diagnostic accuracy, appropriateness of examination, and adequacy of treatment. Statistical comparisons were conducted between the AI systems and expert decisions. Interrater reliability was evaluated with kappa statistics.
Results: ChatGPT-4 diagnosed 82% correctly, outperforming Llama2 at 76%. For additional examinations, ChatGPT-4 suggested relevant and appropriate tests in 88% of the studies, while Llama2 did so in 83%. Treatment appropriateness was achieved in 80% of the cases through ChatGPT-4 and 72% through Llama2. Sometimes, both systems suggested inappropriate tests. The interrater reliability was high for AIPI scores (kappa = 0.85).
Conclusion: ChatGPT-4 and Llama2 have shown great potential as clinical decision-support tools in otolaryngology, with ChatGPT-4 exhibiting superior performance. At the same time, non-relevant recommendations indicate further refinement and human oversight to ensure safe application in clinical practice.
Keywords: AI; Artificial intelligence; ChatGPT-4; Clinical decision making; Lama2.
© 2025. The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
Conflict of interest statement
Declarations. Research involving human participants and/or animals: Human participants. Informed consent: Obtained for all the patients. Prior presentation: This work has not been previously presented at any meeting or conference. Conflicts of interest: The authors declare no conflicts of interest. The author Jerome R. Lechien was not involved with the peer review process of this article
Similar articles
-
Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI).Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2063-2079. doi: 10.1007/s00405-023-08219-y. Epub 2023 Sep 12. Eur Arch Otorhinolaryngol. 2024. PMID: 37698703
-
Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 Versus LLaMA2 for Case Reports.JMIR Form Res. 2024 Nov 19;8:e64844. doi: 10.2196/64844. JMIR Form Res. 2024. PMID: 39561356 Free PMC article.
-
Evaluating advanced AI reasoning models: ChatGPT-4.0 and DeepSeek-R1 diagnostic performance in otolaryngology: a comparative analysis.Am J Otolaryngol. 2025 Jul-Aug;46(4):104667. doi: 10.1016/j.amjoto.2025.104667. Epub 2025 May 10. Am J Otolaryngol. 2025. PMID: 40367837
-
Applications of ChatGPT in Otolaryngology-Head Neck Surgery: A State of the Art Review.Otolaryngol Head Neck Surg. 2024 Sep;171(3):667-677. doi: 10.1002/ohn.807. Epub 2024 May 8. Otolaryngol Head Neck Surg. 2024. PMID: 38716790 Review.
-
Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.Surg Obes Relat Dis. 2024 Jul;20(7):603-608. doi: 10.1016/j.soard.2024.03.011. Epub 2024 Mar 24. Surg Obes Relat Dis. 2024. PMID: 38644078 Review.
References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Miscellaneous