Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot

Chun-Ru Lin^#¹, Yi-Jun Chen^#², Po-An Tsai³, Wen-Yuan Hsieh^{4

5}, Sung Huang Laurent Tsai^{1

6

7

8

9}, Tsai-Sheng Fu¹, Po-Liang Lai¹, Jau-Yuan Chen^{10

11}

Affiliations

¹ Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Linkou Branch, No. 5, Fuxing Street, Guishan District, Taoyuan City, 333, Taiwan.
² Department of Anesthesiology, Chang Gung Memorial Hospital, Taoyuan City, Taiwan.
³ Department of internal medicine, Chang Gung Memorial Hospital Keelung Branch, Keelung, Taiwan.
⁴ Department of Medical Education, Chang Gung Memorial Hospital, Linkou Branch, No. 5, Fuxing Street, Guishan District, Taoyuan City, 333, Taiwan.
⁵ Department of Emergency Medicine, Dalin Tzu Chi Hospital, Buddhist Tzu Chi Medical Foundation, Chiayi, Taiwan.
⁶ Department of Orthopedics, Taipei Medical University Hospital, Taipei, Taiwan.
⁷ Department of Orthopaedics, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan.
⁸ Department of Biomedical Engineering, National Taiwan University, Taipei, Taiwan.
⁹ Graduate Institute of Clinical Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan.
¹⁰ Department of Family Medicine, Chang-Gung Memorial Hospital, Linkou Branch, Taoyuan City, Taiwan. welins@cgmh.org.tw.
¹¹ College of Medicine, Chang-Gung University, Taoyuan City, Taiwan. welins@cgmh.org.tw.

^# Contributed equally.

PMID: 40921907
DOI: 10.1007/s11657-025-01587-4

Comparative Study

Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot

Chun-Ru Lin et al. Arch Osteoporos. 2025.

. 2025 Sep 8;20(1):120.

doi: 10.1007/s11657-025-01587-4.

Authors

Chun-Ru Lin^#¹, Yi-Jun Chen^#², Po-An Tsai³, Wen-Yuan Hsieh^{4

5}, Sung Huang Laurent Tsai^{1

6

7

8

9}, Tsai-Sheng Fu¹, Po-Liang Lai¹, Jau-Yuan Chen^{10

11}

Affiliations

¹ Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Linkou Branch, No. 5, Fuxing Street, Guishan District, Taoyuan City, 333, Taiwan.
² Department of Anesthesiology, Chang Gung Memorial Hospital, Taoyuan City, Taiwan.
³ Department of internal medicine, Chang Gung Memorial Hospital Keelung Branch, Keelung, Taiwan.
⁴ Department of Medical Education, Chang Gung Memorial Hospital, Linkou Branch, No. 5, Fuxing Street, Guishan District, Taoyuan City, 333, Taiwan.
⁵ Department of Emergency Medicine, Dalin Tzu Chi Hospital, Buddhist Tzu Chi Medical Foundation, Chiayi, Taiwan.
⁶ Department of Orthopedics, Taipei Medical University Hospital, Taipei, Taiwan.
⁷ Department of Orthopaedics, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan.
⁸ Department of Biomedical Engineering, National Taiwan University, Taipei, Taiwan.
⁹ Graduate Institute of Clinical Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan.
¹⁰ Department of Family Medicine, Chang-Gung Memorial Hospital, Linkou Branch, Taoyuan City, Taiwan. welins@cgmh.org.tw.
¹¹ College of Medicine, Chang-Gung University, Taoyuan City, Taiwan. welins@cgmh.org.tw.

^# Contributed equally.

PMID: 40921907
DOI: 10.1007/s11657-025-01587-4

Abstract

The study assesses the performance of AI models in evaluating postmenopausal osteoporosis. We found that ChatGPT-4o produced the most appropriate responses, highlighting the potential of AI to enhance clinical decision-making and improve patient care in osteoporosis management.

Purpose: The rise of artificial intelligence (AI) offers the potential for assisting clinical decisions. This study aims to assess the accuracy of various artificial intelligence models in providing recommendations for the diagnosis and treatment of postmenopausal osteoporosis.

Methods: Using questions from the 2020 American Association of Clinical Endocrinologists (AACE) guidelines for osteoporosis, AI models including ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Gemini, Gemini Advanced, and Copilot were prompted. Responses were classified as accurate if they did not contradict the clinical guidelines. Two additional categories, over-conclusive and insufficient, were created to further evaluate responses. Over-conclusive was designated if AI models provided recommendations not specified in the guidelines, while insufficient indicated a failure to provide relevant information included in the guidelines. Chi-square tests were employed to compare categorical outcomes among different AI models.

Results: A total of 42 clinical questions were evaluated. ChatGPT-4o achieved an accuracy of 88%, ChatGPT-3.5 57.1%, ChatGPT-4.0 64.3%, Gemini 45.2%, Gemini Advanced 57.1%, and Copilot 47.6% (p < 0.001).

Conclusions: The study reveals significant response accuracy variations across each AI model, with ChatGPT-4o demonstrating the highest accuracy. Further research is necessary to explore the broader applicability of AI in the medical domains.

Keywords: Artificial intelligence; ChatGPT; Postmenopausal osteoporosis.

PubMed Disclaimer

Conflict of interest statement

Declarations. Conflicts of interest: None.

References

1. Dave T, Athaluri SA, Singh S (2023) ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell 6:1169595 - DOI - PubMed - PMC
1. Liu J, Wang C, Liu S (2023) Utility of ChatGPT in clinical practice. J Med Internet Res 25:e48568 - DOI - PubMed - PMC
1. Temsah M-H, Jamal A, Alhasan K et al (2024) Transforming virtual healthcare: the potentials of ChatGPT-4omni in telemedicine. Cureus 16:e61377 - PubMed - PMC
1. Zhang N, Sun Z, Xie Y et al (2024) The latest version ChatGPT powered by GPT-4o: what will it bring to the medical field? Int J Surg. https://doi.org/10.1097/JS9.0000000000001754 - DOI - PubMed - PMC
1. Sun Z, Yang J, Zhang N et al (2024) GPT-4o is more like a real person: potentials in surgical oncology. Int J Surg. https://doi.org/10.1097/JS9.0000000000001898 - DOI - PubMed - PMC

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Springer

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot

Affiliations

Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot

Authors

Affiliations

Abstract

Conflict of interest statement

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources