. 2024 Mar;76(3):479-484.

doi: 10.1002/art.42737. Epub 2024 Jan 18.

Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study

Carrie Ye¹, Elric Zweck², Zechen Ma¹, Justin Smith¹, Steven Katz¹

Affiliations

PMID: 37902018
DOI: 10.1002/art.42737

Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study

Carrie Ye et al. Arthritis Rheumatol. 2024 Mar.

. 2024 Mar;76(3):479-484.

doi: 10.1002/art.42737. Epub 2024 Jan 18.

Authors

Carrie Ye¹, Elric Zweck², Zechen Ma¹, Justin Smith¹, Steven Katz¹

Affiliations

¹ University of Alberta, Edmonton, Alberta, Canada.
² University Hospital Düsseldorf, Düsseldorf, Germany.

PMID: 37902018
DOI: 10.1002/art.42737

Abstract

Objective: The objective of the current study was to assess the quality of large language model (LLM) chatbot versus physician-generated responses to patient-generated rheumatology questions.

Methods: We conducted a single-center cross-sectional survey of rheumatology patients (n = 17) in Edmonton, Alberta, Canada. Patients evaluated LLM chatbot versus physician-generated responses for comprehensiveness and readability, with four rheumatologists also evaluating accuracy by using a Likert scale from 1 to 10 (1 being poor, 10 being excellent).

Results: Patients rated no significant difference between artificial intelligence (AI) and physician-generated responses in comprehensiveness (mean 7.12 ± SD 0.99 vs 7.52 ± 1.16; P = 0.1962) or readability (7.90 ± 0.90 vs 7.80 ± 0.75; P = 0.5905). Rheumatologists rated AI responses significantly poorer than physician responses on comprehensiveness (AI 5.52 ± 2.13 vs physician 8.76 ± 1.07; P < 0.0001), readability (AI 7.85 ± 0.92 vs physician 8.75 ± 0.57; P = 0.0003), and accuracy (AI 6.48 ± 2.07 vs physician 9.08 ± 0.64; P < 0.0001). The proportion of preference to AI- versus physician-generated responses by patients and physicians was 0.45 ± 0.18 and 0.15 ± 0.08, respectively (P = 0.0106). After learning that one answer for each question was AI generated, patients were able to correctly identify AI-generated answers at a lower proportion compared to physicians (0.49 ± 0.26 vs 0.97 ± 0.04; P = 0.0183). The average word count of AI answers was 69.10 ± 25.35 words, as compared to 98.83 ± 34.58 words for physician-generated responses (P = 0.0008).

Conclusion: Rheumatology patients rated AI-generated responses to patient questions similarly to physician-generated responses in terms of comprehensiveness, readability, and overall preference. However, rheumatologists rated AI responses significantly poorer than physician-generated responses, suggesting that LLM chatbot responses are inferior to physician responses, a difference that patients may not be aware of.

PubMed Disclaimer

Cited by

Evaluating the readability, quality, and reliability of responses generated by ChatGPT, Gemini, and Perplexity on the most commonly asked questions about Ankylosing spondylitis.
Kara M, Ozduran E, Kara MM, Özbek İC, Hancı V. Kara M, et al. PLoS One. 2025 Jun 18;20(6):e0326351. doi: 10.1371/journal.pone.0326351. eCollection 2025. PLoS One. 2025. PMID: 40531978 Free PMC article.
Large Language Models in Diabetes Management: The Need for Human and Artificial Intelligence Collaboration.
Pavon JM, Schlientz D, Maciejewski ML, Economou-Zavlanos N, Lee RH. Pavon JM, et al. Diabetes Care. 2025 Feb 1;48(2):182-184. doi: 10.2337/dci24-0079. Diabetes Care. 2025. PMID: 39841968 No abstract available.
Comparative Analysis of Large Language Models for Answering Cancer-Related Questions in Korean.
Chang H, Jung JW, Kim Y. Chang H, et al. Yonsei Med J. 2025 Jul;66(7):405-411. doi: 10.3349/ymj.2024.0200. Yonsei Med J. 2025. PMID: 40551589 Free PMC article.
Profiling of Cardiogenic Shock: Incorporating Machine Learning Into Bedside Management.
Zweck E, Li S, Burkhoff D, Kapur NK. Zweck E, et al. J Soc Cardiovasc Angiogr Interv. 2024 May 28;4(3Part B):102047. doi: 10.1016/j.jscai.2024.102047. eCollection 2025 Mar. J Soc Cardiovasc Angiogr Interv. 2024. PMID: 40230675 Free PMC article. Review.
Adoption and perception of LLM-based chatbots in health care: an exploratory cross-sectional survey of individuals with rheumatic diseases.
Wang E, Smith J, Katz S, Bishay M, Dissanayake T, Jones N, Reddy S, Sholter D, Soo J, Ye C. Wang E, et al. Rheumatol Adv Pract. 2025 Jul 12;9(3):rkaf083. doi: 10.1093/rap/rkaf083. eCollection 2025. Rheumatol Adv Pract. 2025. PMID: 40800591 Free PMC article.

See all "Cited by" articles

References

REFERENCES

1. Lubbad M. The ultimate guide to GPT-4 parameters: everything you need to know about NLP's game-changer. Medium. March 19, 2023. Accessed June 15, 2023. https://medium.com/@mlubbad/the-ultimate-guide-to-gpt-4-parameters-every...
1. Goodman RS, Patrinely JR Jr, Osterman T, et al. On the cusp: considering the impact of artificial intelligence language models in health care. Med 2023;4(3):139-140.
1. Nerdynav. 97+ ChatGPT statistics & user numbers in May 2023 (new data). Nerdynav. 2022. Accessed May 26, 2023. https://nerdynav.com/chatgpt-statistics/
1. Milne-Ives M, de Cock C, Lim E, et al. The effectiveness of artificial intelligence conversational agents in health care: systematic review. J Med Internet Res 2020;22(10):e20346.
1. Samaan JS, Yeo YH, Rajeev N, et al. Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obes Surg 2023;33(6):1790-1796.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Wiley

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study

Affiliations

Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study

Authors

Affiliations

Abstract

Similar articles

Cited by

References

REFERENCES

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Similar articles

Cited by

References

REFERENCES

MeSH terms

Related information

LinkOut - more resources

Full Text Sources