Urology consultants versus large language models: Potentials and hazards for medical advice in urology

Johanna Eckrich¹, Jörg Ellinger¹, Alexander Cox¹, Johannes Stein¹, Manuel Ritter¹, Andrew Blaikie², Sebastian Kuhn³, Christoph Raphael Buhr^{2

4}

Affiliations

¹ Department of Urology University Hospital Bonn Bonn Germany.
² School of Medicine University of St Andrews St Andrews UK.
³ Institute of Digital Medicine Philipps-University Marburg and University Hospital of Giessen and Marburg Marburg Germany.
⁴ Department of Otorhinolaryngology University Medical Center of the Johannes Gutenberg-University Mainz Mainz Germany.

PMID: 38751951
PMCID: PMC11090772
DOI: 10.1002/bco2.359

Urology consultants versus large language models: Potentials and hazards for medical advice in urology

Johanna Eckrich et al. BJUI Compass. 2024.

. 2024 Apr 3;5(5):438-444.

doi: 10.1002/bco2.359. eCollection 2024 May.

Authors

Johanna Eckrich¹, Jörg Ellinger¹, Alexander Cox¹, Johannes Stein¹, Manuel Ritter¹, Andrew Blaikie², Sebastian Kuhn³, Christoph Raphael Buhr^{2

4}

Affiliations

¹ Department of Urology University Hospital Bonn Bonn Germany.
² School of Medicine University of St Andrews St Andrews UK.
³ Institute of Digital Medicine Philipps-University Marburg and University Hospital of Giessen and Marburg Marburg Germany.
⁴ Department of Otorhinolaryngology University Medical Center of the Johannes Gutenberg-University Mainz Mainz Germany.

PMID: 38751951
PMCID: PMC11090772
DOI: 10.1002/bco2.359

Erratum in

Erratum.
[No authors listed] [No authors listed] BJUI Compass. 2024 Dec 30;5(12):1324-1329. doi: 10.1002/bco2.482. eCollection 2024 Dec. BJUI Compass. 2024. PMID: 39744071 Free PMC article.

Abstract

Background: Current interest surrounding large language models (LLMs) will lead to an increase in their use for medical advice. Although LLMs offer huge potential, they also pose potential misinformation hazards.

Objective: This study evaluates three LLMs answering urology-themed clinical case-based questions by comparing the quality of answers to those provided by urology consultants.

Methods: Forty-five case-based questions were answered by consultants and LLMs (ChatGPT 3.5, ChatGPT 4, Bard). Answers were blindly rated using a six-step Likert scale by four consultants in the categories: 'medical adequacy', 'conciseness', 'coherence' and 'comprehensibility'. Possible misinformation hazards were identified; a modified Turing test was included, and the character count was matched.

Results: Higher ratings in every category were recorded for the consultants. LLMs' overall performance in language-focused categories (coherence and comprehensibility) was relatively high. Medical adequacy was significantly poorer compared with the consultants. Possible misinformation hazards were identified in 2.8% to 18.9% of answers generated by LLMs compared with <1% of consultant's answers. Poorer conciseness rates and a higher character count were provided by LLMs. Among individual LLMs, ChatGPT 4 performed best in medical accuracy (p < 0.0001) and coherence (p = 0.001), whereas Bard received the lowest scores. Generated responses were accurately associated with their source with 98% accuracy in LLMs and 99% with consultants.

Conclusions: The quality of consultant answers was superior to LLMs in all categories. High semantic scores for LLM answers were found; however, the lack of medical accuracy led to potential misinformation hazards from LLM 'consultations'. Further investigations are necessary for new generations.

Keywords: Bard; ChatGPT; artificial intelligence (AI); chatbots; digital health; global health; large language models (LLMs); low‐ and middle‐income countries (LMICs); telehealth; telemedicine; urology.

PubMed Disclaimer

Conflict of interest statement

No third‐party funding was utilized for the design of the study and collection, analysis and interpretation of data and in writing the manuscript. The authors declare no competing interests.

Figures

**FIGURE 1**
The number of characters per answer by urology consultants and large language models (LLMs; ChatGPT 3.5, Chat GT 4, Bard) for all evaluated categories. Data shown as a scatter dot blot with each point resembling an absolute value. Grey horizontal line = Median. The non‐parametric Mann–Whitney test was used to compare the ratings for individual LLMs to the urology consultants (**** = p < 0.0001).

**FIGURE 2**
Comparison between urology consultants and large language model (LLMs; ChatGPT 3.5, Chat GPT 4, Bard) for all evaluated categories. Data shown as a scatter dot blot with each point resembling an absolute value. Grey horizontal line = Median. The non‐parametric Mann–Whitney test was used to compare the ratings for individual LLMs to the urology consultants (**** = p < 0.0001; ** = p < 0.01; * = p < 0.05). Cumulative results of ratings for medical adequacy (A), conciseness (B), coherence (C) and comprehensibility (D).

See this image and copyright information in PMC

References

1. Jungmann SM, Brand S, Kolb J, Witthöft M. Do Dr. Google and health apps have (comparable) side effects? An experimental study. Clin Psychol Sci. 2020;8(2):306–317. 10.1177/2167702619894904 - DOI
1. Cocco AM, Zordan R, Taylor DM, Weiland TJ, Dilley SJ, Kant J, et al. Dr Google in the ED: searching for online health information by adult emergency department patients. Med J Austr. 2018;209(8):342–347. 10.5694/mja17.00889 - DOI - PubMed
1. Buhr CR, Smith H, Huppertz T, Bahr‐Hamm K, Matthias C, Blaikie A, et al. ChatGPT vs. consultants: a pilot study on answering otorhinolaryngology case‐based questions. JMIR Med Educ (forthcoming). 2023;9:e49183. 10.2196/49183 - DOI - PMC - PubMed
1. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:210807258. 2021.
1. Grant N, Metz C. A New Chat Bot Is a ‘Code Red’ for Google's Search Business. The New York Times, 2023.

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Urology consultants versus large language models: Potentials and hazards for medical advice in urology

Affiliations

Urology consultants versus large language models: Potentials and hazards for medical advice in urology

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources