Comparative Study

. 2024 Nov 4:26:e60291.

doi: 10.2196/60291.

Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study

Jonathan Yi-Shin Yau¹, Soheil Saadat², Edmund Hsu², Linda Suk-Ling Murphy³, Jennifer S Roh⁴, Jeffrey Suchard², Antonio Tapia², Warren Wiechmann², Mark I Langdorf²

Affiliations

¹ College of Natural and Agricultural Sciences, University of California - Riverside, Riverside, CA, United States.
² Department of Emergency Medicine, University of California - Irvine, Orange, CA, United States.
³ Reference Department, University of California - Irvine Libraries, Irvine, CA, United States.
⁴ Department of Emergency Medicine, Harbor-UCLA Medical Center, University of California - Los Angeles, Torrance, CA, United States.

PMID: 39496149
PMCID: PMC11574488
DOI: 10.2196/60291

Comparative Study

Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study

Jonathan Yi-Shin Yau et al. J Med Internet Res. 2024.

. 2024 Nov 4:26:e60291.

doi: 10.2196/60291.

Authors

Jonathan Yi-Shin Yau¹, Soheil Saadat², Edmund Hsu², Linda Suk-Ling Murphy³, Jennifer S Roh⁴, Jeffrey Suchard², Antonio Tapia², Warren Wiechmann², Mark I Langdorf²

Affiliations

¹ College of Natural and Agricultural Sciences, University of California - Riverside, Riverside, CA, United States.
² Department of Emergency Medicine, University of California - Irvine, Orange, CA, United States.
³ Reference Department, University of California - Irvine Libraries, Irvine, CA, United States.
⁴ Department of Emergency Medicine, Harbor-UCLA Medical Center, University of California - Los Angeles, Torrance, CA, United States.

PMID: 39496149
PMCID: PMC11574488
DOI: 10.2196/60291

Abstract

Background: Recent surveys indicate that 48% of consumers actively use generative artificial intelligence (AI) for health-related inquiries. Despite widespread adoption and the potential to improve health care access, scant research examines the performance of AI chatbot responses regarding emergency care advice.

Objective: We assessed the quality of AI chatbot responses to common emergency care questions. We sought to determine qualitative differences in responses from 4 free-access AI chatbots, for 10 different serious and benign emergency conditions.

Methods: We created 10 emergency care questions that we fed into the free-access versions of ChatGPT 3.5 (OpenAI), Google Bard, Bing AI Chat (Microsoft), and Claude AI (Anthropic) on November 26, 2023. Each response was graded by 5 board-certified emergency medicine (EM) faculty for 8 domains of percentage accuracy, presence of dangerous information, factual accuracy, clarity, completeness, understandability, source reliability, and source relevancy. We determined the correct, complete response to the 10 questions from reputable and scholarly emergency medical references. These were compiled by an EM resident physician. For the readability of the chatbot responses, we used the Flesch-Kincaid Grade Level of each response from readability statistics embedded in Microsoft Word. Differences between chatbots were determined by the chi-square test.

Results: Each of the 4 chatbots' responses to the 10 clinical questions were scored across 8 domains by 5 EM faculty, for 400 assessments for each chatbot. Together, the 4 chatbots had the best performance in clarity and understandability (both 85%), intermediate performance in accuracy and completeness (both 50%), and poor performance (10%) for source relevance and reliability (mostly unreported). Chatbots contained dangerous information in 5% to 35% of responses, with no statistical difference between chatbots on this metric (P=.24). ChatGPT, Google Bard, and Claud AI had similar performances across 6 out of 8 domains. Only Bing AI performed better with more identified or relevant sources (40%; the others had 0%-10%). Flesch-Kincaid Reading level was 7.7-8.9 grade for all chatbots, except ChatGPT at 10.8, which were all too advanced for average emergency patients. Responses included both dangerous (eg, starting cardiopulmonary resuscitation with no pulse check) and generally inappropriate advice (eg, loosening the collar to improve breathing without evidence of airway compromise).

Conclusions: AI chatbots, though ubiquitous, have significant deficiencies in EM patient advice, despite relatively consistent performance. Information for when to seek urgent or emergent care is frequently incomplete and inaccurate, and patients may be unaware of misinformation. Sources are not generally provided. Patients who use AI to guide health care decisions assume potential risks. AI chatbots for health should be subject to further research, refinement, and regulation. We strongly recommend proper medical consultation to prevent potential adverse outcomes.

Keywords: AI; artificial intelligence; chatbot; chatbots; consumer health information; emergency care information; generative AI; health care; literacy; medical consultation; misinformation; natural language processing; patient education.

©Jonathan Yi-Shin Yau, Soheil Saadat, Edmund Hsu, Linda Suk-Ling Murphy, Jennifer S Roh, Jeffrey Suchard, Antonio Tapia, Warren Wiechmann, Mark I Langdorf. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 04.11.2024.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
Performance of all 4 chatbots in aggregate across 8 different domains, showing the point estimate of the prevalence of the highest or best score in that domain, with the 95% CI.

**Figure 2**
Comparison of the 4 chatbots’ performance against each other in the 8 domains.

**Figure 3**
Comparison of condition-specific performance across different domains.

**Figure 4**
Comparison of safety and dangerousness across different medical conditions.

**Figure 5**
Comparison of factual accuracy across different medical conditions.

**Figure 6**
Comparison between 4 chatbots for the domain of source reliability.

**Figure 7**
Comparison of mean and 95% CI (error bars) for reading level per Microsoft Word Flesch-Kincaid Grade Level (FKGL) score for 4 chatbots.

See this image and copyright information in PMC

References

1. Hoffman M. What is a chatbot + how does it work? The ultimate guide. Zendesk. 2020. [2024-04-15]. https://www.zendesk.com/blog/what-is-a-chatbot/
1. Joseph A, Eapen NG. Conversational agents and chatbots: current trends. In: Pillai AS, Tedesco R, editors. Machine Learning and Deep Learning in Natural Language Processing. Boca Raton, FL: CRC Press; 2023.
1. Laymouna M, Ma Y, Lessard D, Schuster T, Engler K, Lebouché Bertrand. Roles, Users, Benefits, and Limitations of Chatbots in Health Care: Rapid Review. J Med Internet Res. 2024 Jul 23;26:e56930. doi: 10.2196/56930. https://www.jmir.org/2024//e56930/ v26i1e56930 - DOI - PMC - PubMed
1. What is a chatbot? IBM. 2020. [2024-04-15]. https://www.ibm.com/topics/chatbots .
1. Metz C, Grant N. Google updates bard chatbot with ‘Gemini’ A.I. as it chases ChatGPT. The New York Times. 2023. [2024-03-21]. https://www.nytimes.com/2023/12/06/technology/google-ai-bard-chatbot-gem... .

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study

Affiliations

Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical