Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study
- PMID: 37902018
- DOI: 10.1002/art.42737
Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study
Abstract
Objective: The objective of the current study was to assess the quality of large language model (LLM) chatbot versus physician-generated responses to patient-generated rheumatology questions.
Methods: We conducted a single-center cross-sectional survey of rheumatology patients (n = 17) in Edmonton, Alberta, Canada. Patients evaluated LLM chatbot versus physician-generated responses for comprehensiveness and readability, with four rheumatologists also evaluating accuracy by using a Likert scale from 1 to 10 (1 being poor, 10 being excellent).
Results: Patients rated no significant difference between artificial intelligence (AI) and physician-generated responses in comprehensiveness (mean 7.12 ± SD 0.99 vs 7.52 ± 1.16; P = 0.1962) or readability (7.90 ± 0.90 vs 7.80 ± 0.75; P = 0.5905). Rheumatologists rated AI responses significantly poorer than physician responses on comprehensiveness (AI 5.52 ± 2.13 vs physician 8.76 ± 1.07; P < 0.0001), readability (AI 7.85 ± 0.92 vs physician 8.75 ± 0.57; P = 0.0003), and accuracy (AI 6.48 ± 2.07 vs physician 9.08 ± 0.64; P < 0.0001). The proportion of preference to AI- versus physician-generated responses by patients and physicians was 0.45 ± 0.18 and 0.15 ± 0.08, respectively (P = 0.0106). After learning that one answer for each question was AI generated, patients were able to correctly identify AI-generated answers at a lower proportion compared to physicians (0.49 ± 0.26 vs 0.97 ± 0.04; P = 0.0183). The average word count of AI answers was 69.10 ± 25.35 words, as compared to 98.83 ± 34.58 words for physician-generated responses (P = 0.0008).
Conclusion: Rheumatology patients rated AI-generated responses to patient questions similarly to physician-generated responses in terms of comprehensiveness, readability, and overall preference. However, rheumatologists rated AI responses significantly poorer than physician-generated responses, suggesting that LLM chatbot responses are inferior to physician responses, a difference that patients may not be aware of.
© 2023 The Authors. Arthritis & Rheumatology published by Wiley Periodicals LLC on behalf of American College of Rheumatology.
Similar articles
-
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838. JAMA Intern Med. 2023. PMID: 37115527 Free PMC article.
-
Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320. JAMA Netw Open. 2023. PMID: 37606922 Free PMC article.
-
Physician vs. AI-generated messages in urology: evaluation of accuracy, completeness, and preference by patients and physicians.World J Urol. 2024 Dec 27;43(1):48. doi: 10.1007/s00345-024-05399-y. World J Urol. 2024. PMID: 39729119 Free PMC article.
-
Physician and Artificial Intelligence Chatbot Responses to Cancer Questions From Social Media.JAMA Oncol. 2024 Jul 1;10(7):956-960. doi: 10.1001/jamaoncol.2024.0836. JAMA Oncol. 2024. PMID: 38753317 Free PMC article.
-
AI am a rheumatologist: a practical primer to large language models for rheumatologists.Rheumatology (Oxford). 2023 Oct 3;62(10):3256-3260. doi: 10.1093/rheumatology/kead291. Rheumatology (Oxford). 2023. PMID: 37307079 Free PMC article. Review.
Cited by
-
Evaluating the readability, quality, and reliability of responses generated by ChatGPT, Gemini, and Perplexity on the most commonly asked questions about Ankylosing spondylitis.PLoS One. 2025 Jun 18;20(6):e0326351. doi: 10.1371/journal.pone.0326351. eCollection 2025. PLoS One. 2025. PMID: 40531978 Free PMC article.
-
Large Language Models in Diabetes Management: The Need for Human and Artificial Intelligence Collaboration.Diabetes Care. 2025 Feb 1;48(2):182-184. doi: 10.2337/dci24-0079. Diabetes Care. 2025. PMID: 39841968 No abstract available.
-
Comparative Analysis of Large Language Models for Answering Cancer-Related Questions in Korean.Yonsei Med J. 2025 Jul;66(7):405-411. doi: 10.3349/ymj.2024.0200. Yonsei Med J. 2025. PMID: 40551589 Free PMC article.
-
Profiling of Cardiogenic Shock: Incorporating Machine Learning Into Bedside Management.J Soc Cardiovasc Angiogr Interv. 2024 May 28;4(3Part B):102047. doi: 10.1016/j.jscai.2024.102047. eCollection 2025 Mar. J Soc Cardiovasc Angiogr Interv. 2024. PMID: 40230675 Free PMC article. Review.
-
Adoption and perception of LLM-based chatbots in health care: an exploratory cross-sectional survey of individuals with rheumatic diseases.Rheumatol Adv Pract. 2025 Jul 12;9(3):rkaf083. doi: 10.1093/rap/rkaf083. eCollection 2025. Rheumatol Adv Pract. 2025. PMID: 40800591 Free PMC article.
References
REFERENCES
-
- Lubbad M. The ultimate guide to GPT-4 parameters: everything you need to know about NLP's game-changer. Medium. March 19, 2023. Accessed June 15, 2023. https://medium.com/@mlubbad/the-ultimate-guide-to-gpt-4-parameters-every...
-
- Goodman RS, Patrinely JR Jr, Osterman T, et al. On the cusp: considering the impact of artificial intelligence language models in health care. Med 2023;4(3):139-140.
-
- Nerdynav. 97+ ChatGPT statistics & user numbers in May 2023 (new data). Nerdynav. 2022. Accessed May 26, 2023. https://nerdynav.com/chatgpt-statistics/
-
- Milne-Ives M, de Cock C, Lim E, et al. The effectiveness of artificial intelligence conversational agents in health care: systematic review. J Med Internet Res 2020;22(10):e20346.
-
- Samaan JS, Yeo YH, Rajeev N, et al. Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obes Surg 2023;33(6):1790-1796.
MeSH terms
LinkOut - more resources
Full Text Sources