ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions

Christoph Raphael Buhr^{1

2}, Harry Smith³, Tilman Huppertz¹, Katharina Bahr-Hamm¹, Christoph Matthias¹, Andrew Blaikie², Tom Kelsey³, Sebastian Kuhn⁴, Jonas Eckrich¹

Affiliations

¹ Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany.
² School of Medicine, University of St Andrews, St Andrews, United Kingdom.
³ School of Computer Science, University of St Andrews, St Andrews, United Kingdom.
⁴ Institute of Digital Medicine, Philipps-University Marburg and University Hospital of Giessen and Marburg, Marburg, Germany.

PMID: 38051578
PMCID: PMC10731554
DOI: 10.2196/49183

ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions

Christoph Raphael Buhr et al. JMIR Med Educ. 2023.

. 2023 Dec 5:9:e49183.

doi: 10.2196/49183.

Authors

Christoph Raphael Buhr^{1

2}, Harry Smith³, Tilman Huppertz¹, Katharina Bahr-Hamm¹, Christoph Matthias¹, Andrew Blaikie², Tom Kelsey³, Sebastian Kuhn⁴, Jonas Eckrich¹

Affiliations

¹ Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany.
² School of Medicine, University of St Andrews, St Andrews, United Kingdom.
³ School of Computer Science, University of St Andrews, St Andrews, United Kingdom.
⁴ Institute of Digital Medicine, Philipps-University Marburg and University Hospital of Giessen and Marburg, Marburg, Germany.

PMID: 38051578
PMCID: PMC10731554
DOI: 10.2196/49183

Abstract

Background: Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more "consultations" of LLMs about personal medical symptoms.

Objective: This study aims to evaluate ChatGPT's performance in answering clinical case-based questions in otorhinolaryngology (ORL) in comparison to ORL consultants' answers.

Methods: We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 was included to give an insight into the evolving potential of LLMs.

Results: Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT's scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT's answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08) improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001).

Conclusions: While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants' answers. LLMs have potential as augmentative tools for medical care, but their "consultation" for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits.

Keywords: AI; ChatGPT; LLM; LLMs; ORL; artificial intelligence; chatbot; chatbots; digital health; global health; language model; large language models; low- and middle-income countries; otorhinolaryngology; telehealth; telemedicine.

©Christoph Raphael Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich. Originally published in JMIR Medical Education (https://mededu.jmir.org), 05.12.2023.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: SK is the founder and shareholder of MED.digital.

Figures

**Figure 1**
Workflow of the study. ORL: otorhinolaryngology.

**Figure 2**
Comparison between ORL consultants and the LLM (ChatGPT) for all evaluated categories. Data shown as a scatter dot blot with each point resembling an absolute value (bar width resembling a high amount of individual values). Horizontal lines represent mean (95% CI). The nonparametric Mann-Whitney U test was used to compare the 2 groups. Cumulative results of ratings for (A) medical adequacy, (B) conciseness, (C) coherence, and (D) comprehensibility. ORL: otorhinolaryngology. ****P<.001.

**Figure 3**
The number of characters per answer used by ORL consultants and ChatGPT. Data shown as a scatter dot blot with each point resembling an absolute value. Horizontal lines represent the median. The nonparametric Mann-Whitney U test was used to compare the 2 groups. ORL: otorhinolaryngology. ****P<.001.

**Figure 4**
Comparison between LLMs (ChatGPT 3 vs ChatGPT4) for all evaluated categories. Data shown as a scatter dot blot with each point resembling an absolute value (bar width resembling a high amount of individual values). Horizontal lines represent mean (95% CI). The nonparametric Mann-Whitney U test was used to compare the 2 groups. Cumulative results of ratings for (A) medical adequacy, (B) conciseness, (C) coherence, and (D) comprehensibility. ns: not significantly different. *P<.05; **P<.01.

**Figure 5**
The number of characters used by ChatGPT 3 and ChatGPT 4. Data shown as a scatter dot blot with each point resembling an absolute value. Horizontal lines represent the mean. The Welch 2-tailed t test was used to compare the 2 groups. ***P<.001.

See this image and copyright information in PMC

References

1. ChatGPT. OpenAI. 2021. [2023-11-17]. https://openai.com/chatgpt .
1. Surameery NMS, Shakor MY. Use Chat GPT to solve programming bugs. IJITC. 2023;3(1):17–22. doi: 10.55529/ijitc.31.17.22. http://journal.hmjournals.com/index.php/IJITC/article/view/1679/1993 - DOI
1. Zielinski C, Winker MA, Aggarwal R, Ferris LE, Heinemann M, Lapeña JFJ, Pai SA, Ing E, Citrome L, Alam M, Voight M, Habibzadeh F. WAME. WAME; 2023. [2023-11-17]. Chatbots, generative AI, and scholarly manuscripts: WAME recommendations on chatbots and generative artificial intelligence in relation to scholarly publications. https://wame.org/page3.php?id=106 . - PMC - PubMed
1. Grant N, Metz C. A new chat bot is a 'code red' for Google's search business. The New York Times. 2023. [2023-11-17]. https://www.nytimes.com/2022/12/21/technology/ai-chatgpt-google-search.html .
1. Google buys UK artificial intelligence start-up DeepMind. BBC. 2014. [2023-11-17]. https://www.bbc.com/news/technology-25908379 .

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions

Affiliations

ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous