. 2025 Feb 1;48(2):185-192.

doi: 10.2337/dc24-1067.

Large Language Model GPT-4 Compared to Endocrinologist Responses on Initial Choice of Glucose-Lowering Medication Under Conditions of Clinical Uncertainty

James H Flory¹, Jessica S Ancker², Scott Y H Kim³, Gilad Kuperman¹, Aleksandr Petrov¹, Andrew Vickers¹

Affiliations

¹ Memorial Sloan Kettering Cancer Center, New York, NY.
² Vanderbilt University Medical Center, Nashville, TN.
³ National Institutes of Health, Bethesda, MD.

PMID: 39250109
PMCID: PMC11770168
DOI: 10.2337/dc24-1067

Large Language Model GPT-4 Compared to Endocrinologist Responses on Initial Choice of Glucose-Lowering Medication Under Conditions of Clinical Uncertainty

James H Flory et al. Diabetes Care. 2025.

. 2025 Feb 1;48(2):185-192.

doi: 10.2337/dc24-1067.

Authors

James H Flory¹, Jessica S Ancker², Scott Y H Kim³, Gilad Kuperman¹, Aleksandr Petrov¹, Andrew Vickers¹

Affiliations

¹ Memorial Sloan Kettering Cancer Center, New York, NY.
² Vanderbilt University Medical Center, Nashville, TN.
³ National Institutes of Health, Bethesda, MD.

PMID: 39250109
PMCID: PMC11770168
DOI: 10.2337/dc24-1067

Abstract

Objective: To explore how the commercially available large language model (LLM) GPT-4 compares to endocrinologists when addressing medical questions when there is uncertainty regarding the best answer.

Research design and methods: This study compared responses from GPT-4 to responses from 31 endocrinologists using hypothetical clinical vignettes focused on diabetes, specifically examining the prescription of metformin versus alternative treatments. The primary outcome was the choice between metformin and other treatments.

Results: With a simple prompt, GPT-4 chose metformin in 12% (95% CI 7.9-17%) of responses, compared with 31% (95% CI 23-39%) of endocrinologist responses. After modifying the prompt to encourage metformin use, the selection of metformin by GPT-4 increased to 25% (95% CI 22-28%). GPT-4 rarely selected metformin in patients with impaired kidney function, or a history of gastrointestinal distress (2.9% of responses, 95% CI 1.4-5.5%). In contrast, endocrinologists often prescribed metformin even in patients with a history of gastrointestinal distress (21% of responses, 95% CI 12-36%). GPT-4 responses showed low variability on repeated runs except at intermediate levels of kidney function.

Conclusions: In clinical scenarios with no single right answer, GPT-4's responses were reasonable, but differed from endocrinologists' responses in clinically important ways. Value judgments are needed to determine when these differences should be addressed by adjusting the model. We recommend against reliance on LLM output until it is shown to align not just with clinical guidelines but also with patient and clinician preferences, or it demonstrates improvement in clinical outcomes over standard of care.

PubMed Disclaimer

Conflict of interest statement

Duality of Interest. No potential conflicts of interest relevant to this article were reported.

Figures

**Figure 1**
Prompt structure and example prompt and response. Italicized text denotes nudge toward metformin use that was included in the primary prompt.

**Figure 2**
Rate of metformin prescribing by eGFR and respondent type. Solid line represents endocrinologist responses; dashed line is original GPT-4 prompt; dash-dotted line is GPT-4 prompt with default to metformin.

**Figure 3**
Univariable associations between vignette characteristics and metformin selection, by respondent type. Open dots denote GPT-4 response; solid dots denote endocrinologist response. Lines denote 95% CIs. ORs for age and eGFR represent the association with a 10-year or 10 mL/min/1.73 m² increase in the value of that parameter, respectively. Estimates are also given in Supplementary Table 6.

See this image and copyright information in PMC

Comment in

Large Language Models in Diabetes Management: The Need for Human and Artificial Intelligence Collaboration.
Pavon JM, Schlientz D, Maciejewski ML, Economou-Zavlanos N, Lee RH. Pavon JM, et al. Diabetes Care. 2025 Feb 1;48(2):182-184. doi: 10.2337/dci24-0079. Diabetes Care. 2025. PMID: 39841968 Free PMC article. No abstract available.

References

1. Spotnitz M, Idnay B, Gordon ER, et al. A survey of clinicians' views of the utility of large language models. Appl Clin Inform 2024;15:306–312 - PMC - PubMed
1. Goodman RS, Patrinely JR, Stone CA, Jr, et al. Accuracy and reliability of Chatbot responses to physician questions. JAMA Netw Open 2023;6:e2336483. - PMC - PubMed
1. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198. - PMC - PubMed
1. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on medical challenge problems. 12 April 2023. [preprint]. arXiv:2303.13375v2
1. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature Aug 2023;620:172–180 - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large Language Model GPT-4 Compared to Endocrinologist Responses on Initial Choice of Glucose-Lowering Medication Under Conditions of Clinical Uncertainty

Affiliations

Large Language Model GPT-4 Compared to Endocrinologist Responses on Initial Choice of Glucose-Lowering Medication Under Conditions of Clinical Uncertainty

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Medical