. 2024 Feb 13:10:e51391.

doi: 10.2196/51391.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

Tassallah Abdullahi¹, Ritambhara Singh^{1

2}, Carsten Eickhoff³

Affiliations

¹ Department of Computer Science, Brown University, Providence, RI, United States.
² Center for Computational Molecular Biology, Brown University, Providence, RI, United States.
³ School of Medicine, University of Tübingen, Tübingen, Germany.

PMID: 38349725
PMCID: PMC10900078
DOI: 10.2196/51391

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

Tassallah Abdullahi et al. JMIR Med Educ. 2024.

. 2024 Feb 13:10:e51391.

doi: 10.2196/51391.

Authors

Tassallah Abdullahi¹, Ritambhara Singh^{1

2}, Carsten Eickhoff³

Affiliations

¹ Department of Computer Science, Brown University, Providence, RI, United States.
² Center for Computational Molecular Biology, Brown University, Providence, RI, United States.
³ School of Medicine, University of Tübingen, Tübingen, Germany.

PMID: 38349725
PMCID: PMC10900078
DOI: 10.2196/51391

Abstract

Background: Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains.

Objective: This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance.

Methods: We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks.

Results: Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs.

Conclusions: Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model's characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.

Keywords: AI assistance; Bard; ChatGPT 3.5; GPT-4; MedAlpaca; artificial intelligence; clinical decision support; complex diagnosis; complex diseases; consistency; language model; medical education; medical training; natural language processing; prediction model; prompt engineering; rare diseases; reliability.

©Tassallah Abdullahi, Ritambhara Singh, Carsten Eickhoff. Originally published in JMIR Medical Education (https://mededu.jmir.org), 13.02.2024.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
Our proposed method contains the following steps: (1) prompt a language model using a distinct set of prompts, (2) obtain diverse responses, and (3) choose the most consistent response as the final answer (majority voting).

**Figure 2**
Results of the diagnostic case challenge collection data set comparing prompt strategies. OpenAI GPT-4 outperformed all other models, achieving the highest score in all 30 cases using the majority voting approach. Furthermore, all large language models except MedAlpaca outperformed the human consensus (denoted by a black dashed line) across all cases, regardless of the difficulty, using at least 1 prompt approach. GPT-4: generative pretrained transformer-4.

**Figure 3**
Results of the Medical Information Mart for Intensive Care-III data set across prompt strategies. Approach 1 (open-ended prompt) proved challenging for all the large language models compared with approach 2 (multiple-choice prompt) and approach 3 (ranking prompt).

See this image and copyright information in PMC

Cited by

Patient-Representing Population's Perceptions of GPT-Generated Versus Standard Emergency Department Discharge Instructions: Randomized Blind Survey Assessment.
Huang T, Safranek C, Socrates V, Chartash D, Wright D, Dilip M, Sangal RB, Taylor RA. Huang T, et al. J Med Internet Res. 2024 Aug 2;26:e60336. doi: 10.2196/60336. J Med Internet Res. 2024. PMID: 39094112 Free PMC article. Clinical Trial.
Large language models in critical care.
Biesheuvel LA, Workum JD, Reuland M, van Genderen ME, Thoral P, Dongelmans D, Elbers P. Biesheuvel LA, et al. J Intensive Med. 2024 Dec 24;5(2):113-118. doi: 10.1016/j.jointm.2024.12.001. eCollection 2025 Apr. J Intensive Med. 2024. PMID: 40241839 Free PMC article. Review.
The large language model diagnoses tuberculous pleural effusion in pleural effusion patients through clinical feature landscapes.
Wu C, Liu W, Mei P, Liu Y, Cai J, Liu L, Wang J, Ling X, Wang M, Cheng Y, He M, He Q, He Q, Yuan X, Tong J. Wu C, et al. Respir Res. 2025 Feb 12;26(1):52. doi: 10.1186/s12931-025-03130-y. Respir Res. 2025. PMID: 39939874 Free PMC article.
Evaluating multimodal AI in medical diagnostics.
Kaczmarczyk R, Wilhelm TI, Martin R, Roos J. Kaczmarczyk R, et al. NPJ Digit Med. 2024 Aug 7;7(1):205. doi: 10.1038/s41746-024-01208-3. NPJ Digit Med. 2024. PMID: 39112822 Free PMC article.
Toward Clinical Generative AI: Conceptual Framework.
Bragazzi NL, Garbarino S. Bragazzi NL, et al. JMIR AI. 2024 Jun 7;3:e55957. doi: 10.2196/55957. JMIR AI. 2024. PMID: 38875592 Free PMC article.

See all "Cited by" articles

References

1. Introducing ChatGPT. OpenAI. [2023-03-23]. https://openai.com/blog/chatgpt/
1. Manyika J, Hsiao S. An overview of Bard: an early experiment with generative AI. Google. [2024-01-26]. https://ai.google/static/documents/google-about-bard.pdf .
1. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R, Babuschkin I, Balaji S, Balcom V, Baltescu P, Bao H, Bavarian M, Belgum J, Bello I, Berdine J, Bernadett-Shapiro G, Berner C, Bogdonoff L, Boiko O, Boyd M, Brakman AL, Brockman G, Brooks T, Brundage M, Button K, Cai T, Campbell R, Cann A, Carey B, Carlson C, Carmichael R, Chan B, Chang C, Chantzis F, Chen D, Chen S, Chen R, Chen J, Chen M, Chess B, Cho C, Chu C, Chung HW, Cummings D, Currier J, Dai Y, Decareaux C, Degry T, Deutsch N, Deville D, Dhar A, Dohan D, Dowling S, Dunning S, Ecoffet A, Eleti A, Eloundou T, Farhi D, Fedus L, Felix N, Fishman SP, Forte J, Fulford I, Gao L, Georges E, Gibson C, Goel V, Gogineni T, Goh G, Gontijo-Lopes R, Gordon J, Grafstein M, Gray S, Greene R, Gross J, Gu SS, Guo Y, Hallacy C, Han J, Harris J, He Y, Heaton M, Heidecke J, Hesse C, Hickey A, Hickey W, Hoeschele P, Houghton B, Hsu K, Hu S, Hu X, Huizinga J, Jain S, Jain S. GPT-4 technical report. arXiv. Preprint posted online March 15, 2023. https://arxiv.org/abs/2303.08774
1. Resnick DK. Commentary: performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery. 2023 Jul 19; doi: 10.1227/neu.0000000000002618. (forthcoming)00006123-990000000-00814 - DOI - PubMed
1. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. https://europepmc.org/abstract/MED/36812645 PDIG-D-22-00371 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

Affiliations

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources