Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot
- PMID: 40921907
- DOI: 10.1007/s11657-025-01587-4
Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot
Abstract
The study assesses the performance of AI models in evaluating postmenopausal osteoporosis. We found that ChatGPT-4o produced the most appropriate responses, highlighting the potential of AI to enhance clinical decision-making and improve patient care in osteoporosis management.
Purpose: The rise of artificial intelligence (AI) offers the potential for assisting clinical decisions. This study aims to assess the accuracy of various artificial intelligence models in providing recommendations for the diagnosis and treatment of postmenopausal osteoporosis.
Methods: Using questions from the 2020 American Association of Clinical Endocrinologists (AACE) guidelines for osteoporosis, AI models including ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Gemini, Gemini Advanced, and Copilot were prompted. Responses were classified as accurate if they did not contradict the clinical guidelines. Two additional categories, over-conclusive and insufficient, were created to further evaluate responses. Over-conclusive was designated if AI models provided recommendations not specified in the guidelines, while insufficient indicated a failure to provide relevant information included in the guidelines. Chi-square tests were employed to compare categorical outcomes among different AI models.
Results: A total of 42 clinical questions were evaluated. ChatGPT-4o achieved an accuracy of 88%, ChatGPT-3.5 57.1%, ChatGPT-4.0 64.3%, Gemini 45.2%, Gemini Advanced 57.1%, and Copilot 47.6% (p < 0.001).
Conclusions: The study reveals significant response accuracy variations across each AI model, with ChatGPT-4o demonstrating the highest accuracy. Further research is necessary to explore the broader applicability of AI in the medical domains.
Keywords: Artificial intelligence; ChatGPT; Postmenopausal osteoporosis.
© 2025. The Author(s), under exclusive licence to the International Osteoporosis Foundation and the Bone Health and Osteoporosis Foundation.
Conflict of interest statement
Declarations. Conflicts of interest: None.
References
-
- Zhang N, Sun Z, Xie Y et al (2024) The latest version ChatGPT powered by GPT-4o: what will it bring to the medical field? Int J Surg. https://doi.org/10.1097/JS9.0000000000001754 - DOI - PubMed - PMC
-
- Sun Z, Yang J, Zhang N et al (2024) GPT-4o is more like a real person: potentials in surgical oncology. Int J Surg. https://doi.org/10.1097/JS9.0000000000001898 - DOI - PubMed - PMC
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources