. 2023 Oct 2;6(10):e2336483.

doi: 10.1001/jamanetworkopen.2023.36483.

Accuracy and Reliability of Chatbot Responses to Physician Questions

Rachel S Goodman¹, J Randall Patrinely², Cosby A Stone Jr³, Eli Zimmerman⁴, Rebecca R Donald⁵, Sam S Chang⁶, Sean T Berkowitz⁷, Avni P Finn⁷, Eiman Jahangir⁸, Elizabeth A Scoville⁹, Tyler S Reese¹⁰, Debra L Friedman¹¹, Julie A Bastarache³, Yuri F van der Heijden¹², Jordan J Wright¹³, Fei Ye¹⁴, Nicholas Carter¹⁵, Matthew R Alexander¹⁶, Jennifer H Choe¹⁷, Cody A Chastain¹², John A Zic², Sara N Horst⁹, Isik Turker¹⁸, Rajiv Agarwal¹⁷, Evan Osmundson¹⁹, Kamran Idrees²⁰, Colleen M Kiernan²⁰, Chandrasekhar Padmanabhan²⁰, Christina E Bailey²⁰, Cameron E Schlegel²⁰, Lola B Chambless²¹, Michael K Gibson¹⁷, Travis J Osterman²², Lee E Wheless², Douglas B Johnson¹⁷

Affiliations

¹ Vanderbilt University School of Medicine, Nashville, Tennessee.
² Department of Dermatology, Vanderbilt University Medical Center, Nashville, Tennessee.
³ Department of Allergy, Pulmonology, and Critical Care, Vanderbilt University Medical Center, Nashville, Tennessee.
⁴ Department of Neurology, Vanderbilt University Medical Center, Nashville, Tennessee.
⁵ Department of Anesthesiology, Vanderbilt University Medical Center, Nashville, Tennessee.
⁶ Department of Urology, Vanderbilt University Medical Center, Nashville, Tennessee.
⁷ Vanderbilt Eye Institute, Department of Ophthalmology, Vanderbilt University Medical, Nashville, Tennessee.
⁸ Department of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, Tennessee.
⁹ Department of Gastroenterology, Hepatology, and Nutrition, Vanderbilt University Medical Center, Nashville, Tennessee.
¹⁰ Department of Rheumatology and Immunology, Vanderbilt University Medical Center, Nashville, Tennessee.
¹¹ Department of Pediatric Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
¹² Department of Infectious Disease, Vanderbilt University Medical Center, Nashville, Tennessee.
¹³ Department of Diabetes, Endocrinology, and Metabolism, Vanderbilt University Medical Center, Nashville, Tennessee.
¹⁴ Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee.
¹⁵ Division of Trauma and Surgical Critical Care, University of Miami Miller School of Medicine, Miami, Florida.
¹⁶ Department of Cardiovascular Medicine and Clinical Pharmacology, Vanderbilt University Medical Center, Nashville, Tennessee.
¹⁷ Department of Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
¹⁸ Department of Cardiology, Washington University School of Medicine in St Louis, St Louis, Missouri.
¹⁹ Department of Radiation Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
²⁰ Department of Surgical Oncology & Endocrine Surgery, Vanderbilt University Medical Center, Nashville, Tennessee.
²¹ Department of Neurological Surgery, Vanderbilt University Medical Center, Nashville, Tennessee.
²² Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee.

PMID: 37782499
PMCID: PMC10546234
DOI: 10.1001/jamanetworkopen.2023.36483

Accuracy and Reliability of Chatbot Responses to Physician Questions

Rachel S Goodman et al. JAMA Netw Open. 2023.

. 2023 Oct 2;6(10):e2336483.

doi: 10.1001/jamanetworkopen.2023.36483.

Authors

Affiliations

¹ Vanderbilt University School of Medicine, Nashville, Tennessee.
² Department of Dermatology, Vanderbilt University Medical Center, Nashville, Tennessee.
³ Department of Allergy, Pulmonology, and Critical Care, Vanderbilt University Medical Center, Nashville, Tennessee.
⁴ Department of Neurology, Vanderbilt University Medical Center, Nashville, Tennessee.
⁵ Department of Anesthesiology, Vanderbilt University Medical Center, Nashville, Tennessee.
⁶ Department of Urology, Vanderbilt University Medical Center, Nashville, Tennessee.
⁷ Vanderbilt Eye Institute, Department of Ophthalmology, Vanderbilt University Medical, Nashville, Tennessee.
⁸ Department of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, Tennessee.
⁹ Department of Gastroenterology, Hepatology, and Nutrition, Vanderbilt University Medical Center, Nashville, Tennessee.
¹⁰ Department of Rheumatology and Immunology, Vanderbilt University Medical Center, Nashville, Tennessee.
¹¹ Department of Pediatric Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
¹² Department of Infectious Disease, Vanderbilt University Medical Center, Nashville, Tennessee.
¹³ Department of Diabetes, Endocrinology, and Metabolism, Vanderbilt University Medical Center, Nashville, Tennessee.
¹⁴ Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee.
¹⁵ Division of Trauma and Surgical Critical Care, University of Miami Miller School of Medicine, Miami, Florida.
¹⁶ Department of Cardiovascular Medicine and Clinical Pharmacology, Vanderbilt University Medical Center, Nashville, Tennessee.
¹⁷ Department of Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
¹⁸ Department of Cardiology, Washington University School of Medicine in St Louis, St Louis, Missouri.
¹⁹ Department of Radiation Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.
²⁰ Department of Surgical Oncology & Endocrine Surgery, Vanderbilt University Medical Center, Nashville, Tennessee.
²¹ Department of Neurological Surgery, Vanderbilt University Medical Center, Nashville, Tennessee.
²² Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee.

PMID: 37782499
PMCID: PMC10546234
DOI: 10.1001/jamanetworkopen.2023.36483

Abstract

Importance: Natural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency.

Objective: To assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence-generated medical information.

Design, setting, and participants: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023.

Main outcomes and measures: Accuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses.

Results: Across all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5), were regenerated and rescored using version 4 with improvement (mean accuracy [SD] score, 5.2 [1.5] vs 5.7 [0.8]; median score, 6.0 [IQR, 5.0-6.0] for original and 6.0 [IQR, 6.0-6.0] for rescored; P = .002).

Conclusions and relevance: In this cross-sectional study, chatbot generated largely accurate information to diverse medical queries as judged by academic physician specialists with improvement over time, although it had important limitations. Further research and model development are needed to correct inaccuracies and for validation.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Disclosures: Ms Goodman reported receiving research support from the SCRIPS Foundation and the Burroughs Wellcome Fund. Dr Stone reported receiving grants from the American Academy of Allergy, Asthma, and the Immunology Foundation Faculty Development Award during the conduct of the study. Dr Finn reported being on the advisory boards of Eyepoint, Apellis, Allergan, Alimera, and Iveric bio and consulting for and being on the advisory board of Genentech outside the submitted work. Dr Horst reported receiving personal fees from AbbVie, Takeda, BMS, and Janssen outside the submitted work. Dr Agarwal reported receiving personal fees from the American Society of Clinical Oncology (ASCO) (honoraria for faculty speaker at ASCO Advantage Course), the National Comprehensive Cancer Network (NCCN) (honoraria for faculty speaker at NCCN annual conference), the Great Debates and Updates in Gastrointestinal Malignancies (honoraria for faculty speaker), and OncLive (honoraria for faculty speaker) outside the submitted work. Dr Osmundson reported receiving grants from the National Institutes of Health (NIH) unrelated to the present study and grants from Varian Medical Systems unrelated outside the submitted work. Dr Chambless reported receving consulting fees from Integra outside the submitted work. Dr Osterman reported receiving grants from Microsoft, IBM, and GE Healthcare outside the submitted work. Dr Wheless reported receiving grants from the Department of Veterans Affairs during the conduct of the study. Dr Johnson reported receiving grants from BMS and Incyte outside the submitted work and being on the advisory boards of BMS, Catalyst, Merck, Iovance, Novartis, Pfizer. No other disclosures were reported.

Figures

**Figure 1.. Methods**
AI indicates artificial intelligence. ^aD.B.J. and L.E.W. scored 2 separate data sets of melanoma and immunotherapy and common conditions questions. ^bRegenerated answers were created 8 to 17 days after initial answers. ^cRegenerated answers were created 90 days after initial answers.

**Figure 2.. Accuracy of Chatbot-Generated Answers**
Accuracy of artificial intelligence answers from multispecialty questions (A-C [P < .01 for panel C]) or all questions (multispecialty, melanoma and immunotherapy, and common medical conditions; D-F [P = .03 for panel E]). A, Among all descriptive questions in the multispecialty analysis, median accuracy scores were 5.0 (IQR, 3.0-6.0) (mean [SD] score, 4.9 [1.5]) for easy, 5.0 (IQR, 3.0-6.0) (mean [SD] score, 4.4 [1.9]) for medium, and 5.0 (IQR, 3.0-6.0) (mean [SD] score, 4.1 [1.8]) for hard questions (P = .70 determined by the Kruskal-Wallis test). B, Among all binary questions in the multispecialty analysis, median accuracy scores were 6.0 (IQR, 5.0-6.0) (mean [SD] score, 4.9 [1.8]) for easy, 4.0 (IQR, 3.0-6.0) (mean [SD] score, 4.3 [1.6]) for medium, and 5.0 (IQR, 1.0-6.0) (mean [SD] score, 4.2 [1.8]) for hard answers (P = .10 determined by the Kruskal-Wallis test). C, Of 36 questions with accuracy scores of 2 or lower, 34 were requeried or regraded 8 to 17 days later. The median accuracy score for original questions was 2.0 (IQR, 1.0-2.0) (mean [SD] score, 1.6 [0.5]) compared with 4.0 (IQR, 2.0-5.3) (mean [SD] score, 3.9 [1.8]) for rescored answers (P < .01 determined by Wilcoxon signed rank test). D, Among all descriptive questions, median accuracy scores for easy, medium, and hard questions were 5.3 (IQR, 3.0-6.0) (mean [SD] score, 4.8 [1.5]) for easy, 5.5 (IQR, 3.3-6.0) (mean [SD] score, 4.7 [1.7]) for medium, and 5.0 (IQR, 3.6-6.0) (mean [SD] score, 4.5 [1.6]) for hard questions (P = .40 determined by the Kruskal-Wallis test). E, Among all binary questions, median accuracy scores were 6.0 (IQR, 5.0-6.0) (mean [SD] score, 5.3 [1.5]) for easy, 5.5 (IQR, 3.4-6.0) (mean [SD] score, 4.6 [1.6) for medium, and 5.5 (IQR, 4.0-6.0) (mean [SD] score, 4.8 [1.6]) for hard questions, which resulted in a significant difference among groups (P = .03 determined by the Kruskal-Wallis test). F, Median accuracy scores were 5.0 (IQR, 3.4-6.0) (mean [SD] score, 4.7 [1.6]) for all descriptive questions and 6.0 (IQR, 4.0-6.0) (mean [SD] score, 4.9 [1.6]) for binary questions (P = .07 determined by Mann-Whitney U test).

See this image and copyright information in PMC

Comment in

doi: 10.1001/jamanetworkopen.2023.35924

References

1. Shen Y, Heacock L, Elias J, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology. 2023;307(2):e230163. doi: 10.1148/radiol.230163 - DOI - PubMed
1. Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. arXiv. Preprint posted online May 28, 2020. doi: 10.48550/arXiv.2005.14165 - DOI
1. Christiano PF, Leike J, Brown T, Martic M, Legg S, Amodei D. Deep reinforcement learning from human preferences. arXiv. Preprint posted online February 17, 2023. https://arxiv.org/pdf/1706.03741.pdf
1. Liu S, Wright AP, Patterson BL, et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc. 2023;30(7):1237-1245. doi: 10.1093/jamia/ocad072 - DOI - PMC - PubMed
1. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. medRxiv. Preprint posted online December 21, 2022. doi: 10.1101/2022.12.19.22283643 - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accuracy and Reliability of Chatbot Responses to Physician Questions

Affiliations

Accuracy and Reliability of Chatbot Responses to Physician Questions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources