Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 1:26:e58831.
doi: 10.2196/58831.

"Doctor ChatGPT, Can You Help Me?" The Patient's Perspective: Cross-Sectional Study

Affiliations

"Doctor ChatGPT, Can You Help Me?" The Patient's Perspective: Cross-Sectional Study

Jonas Armbruster et al. J Med Internet Res. .

Abstract

Background: Artificial intelligence and the language models derived from it, such as ChatGPT, offer immense possibilities, particularly in the field of medicine. It is already evident that ChatGPT can provide adequate and, in some cases, expert-level responses to health-related queries and advice for patients. However, it is currently unknown how patients perceive these capabilities, whether they can derive benefit from them, and whether potential risks, such as harmful suggestions, are detected by patients.

Objective: This study aims to clarify whether patients can get useful and safe health care advice from an artificial intelligence chatbot assistant.

Methods: This cross-sectional study was conducted using 100 publicly available health-related questions from 5 medical specialties (trauma, general surgery, otolaryngology, pediatrics, and internal medicine) from a web-based platform for patients. Responses generated by ChatGPT-4.0 and by an expert panel (EP) of experienced physicians from the aforementioned web-based platform were packed into 10 sets consisting of 10 questions each. The blinded evaluation was carried out by patients regarding empathy and usefulness (assessed through the question: "Would this answer have helped you?") on a scale from 1 to 5. As a control, evaluation was also performed by 3 physicians in each respective medical specialty, who were additionally asked about the potential harm of the response and its correctness.

Results: In total, 200 sets of questions were submitted by 64 patients (mean 45.7, SD 15.9 years; 29/64, 45.3% male), resulting in 2000 evaluated answers of ChatGPT and the EP each. ChatGPT scored higher in terms of empathy (4.18 vs 2.7; P<.001) and usefulness (4.04 vs 2.98; P<.001). Subanalysis revealed a small bias in terms of levels of empathy given by women in comparison with men (4.46 vs 4.14; P=.049). Ratings of ChatGPT were high regardless of the participant's age. The same highly significant results were observed in the evaluation of the respective specialist physicians. ChatGPT outperformed significantly in correctness (4.51 vs 3.55; P<.001). Specialists rated the usefulness (3.93 vs 4.59) and correctness (4.62 vs 3.84) significantly lower in potentially harmful responses from ChatGPT (P<.001). This was not the case among patients.

Conclusions: The results indicate that ChatGPT is capable of supporting patients in health-related queries better than physicians, at least in terms of written advice through a web-based platform. In this study, ChatGPT's responses had a lower percentage of potentially harmful advice than the web-based EP. However, it is crucial to note that this finding is based on a specific study design and may not generalize to all health care settings. Alarmingly, patients are not able to independently recognize these potential dangers.

Keywords: AI; ChatGPT; LLM; artificial intelligence; chatbot; chatbots; empathy; large language models; patient education; patient information; patient perceptions.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Study workflow. (A): Identification of 100 patient questions, 20 questions per specialty. (B + C): Collection of existing responses from a web-based EP (B) and generation of new responses from ChatGPT (C). (D): Building database with anonymized questions and responses. (E + F): Assembly of specialty-specific packages for physicians (E) and mixed packages for patients (F). (G + H): Data collection: patients rated responses for empathy and usefulness, while physicians provided feedback encompassing empathy, usefulness, correctness, and potential harm. ENT: otolaryngology; EP: expert panel; GS: general surgery; Internal: internal medicine; Ped: pediatrics; trauma: traumatology.
Figure 2
Figure 2
Rating of ChatGPT versus EP by specialists in their respective field—combined specialties. (A) Empathy. (B) Usefulness. (C) Correctness. (D) Potential harm. EP: expert panel.
Figure 3
Figure 3
Rating of ChatGPT by specialists in their respective fields—specialties separated. (A) Empathy. (B) Usefulness. (C) Correctness. (D) Potential harm. P values of Bonferroni post hoc test >0.99 each but empathy ENT versus Internal P=.826. ENT: otolaryngology; GS: general surgery; Internal: internal medicine; NS: not significant; Ped: pediatrics; trauma: traumatology.
Figure 4
Figure 4
Rating of ChatGPT versus EP by patients—combined specialties. (A) Empathy. (B) Usefulness. EP: expert panel.
Figure 5
Figure 5
Rating of ChatGPT by patients—specialties separated. (A) Empathy. (B) Usefulness. P values of Bonferroni post hoc test >0.99 each. ENT: otolaryngology; GS: general surgery; Internal: internal medicine; NS: not significant; Ped: pediatrics; trauma: traumatology.
Figure 6
Figure 6
Rating of ChatGPT by patients—gender separated. (A) Empathy. (B) Usefulness.
Figure 7
Figure 7
Rating of ChatGPT by patients—results in correlation to age. (A) Empathy, Pearson correlation: –0.067. (B) Usefulness, Pearson correlation: –0.109.
Figure 8
Figure 8
Rating of ChatGPT by physicians and patients—potentially harmful and nonharmful advice separated. (A) Empathy—patients. (B) Usefulness—patients. (C) Empathy—physicians. (D) Usefulness—physicians. (E) Correctness—physicians. Δ indicates differences of mean.

References

    1. De Angelis L, Baglivo F, Arzilli G, Privitera GP, Ferragina P, Tozzi AE, Rizzo C. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front Public Health. 2023;11:1166120. doi: 10.3389/fpubh.2023.1166120. https://europepmc.org/abstract/MED/37181697 - DOI - PMC - PubMed
    1. Christiano PF, Leike J, Brown TB, Martic M, Legg S, Amodei D. Deep reinforcement learning from human preferences. arXiv. Preprint posted online. 2013:1–17. doi: 10.5260/chara.21.2.8. - DOI
    1. Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, Moy L. ChatGPT and other large language models are double-edged swords. Radiology. 2023;307(2):e230163. doi: 10.1148/radiol.230163. - DOI - PubMed
    1. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. Training language models tof ollow instructions with human feedback. arXiv. Published online. 2022 Mar 4;:1–68.
    1. Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, Agirre E, Heintz I, Roth D. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput. Surv. 2023 Sep 14;56(2):1–40. doi: 10.1145/3605943. - DOI