Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study
- PMID: 37606976
- PMCID: PMC10481210
- DOI: 10.2196/48659
Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study
Abstract
Background: Large language model (LLM)-based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated.
Objective: This study aimed to evaluate ChatGPT's capacity for ongoing clinical decision support via its performance on standardized clinical vignettes.
Methods: We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT's performance on clinical tasks.
Results: ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%; P<.001) and clinical management (β=-7.4%; P=.02) question types.
Conclusions: ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT's training data set.
Keywords: AI; ChatGPT; GPT; Generative Pre-trained Transformer; LLMs; accuracy; artificial intelligence; chatbot; clinical decision support; clinical vignettes; decision-making; development; large language models; usability; utility.
©Arya Rao, Michael Pang, John Kim, Meghana Kamineni, Winston Lie, Anoop K Prasad, Adam Landman, Keith Dreyer, Marc D Succi. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 22.08.2023.
Conflict of interest statement
Conflicts of Interest: None declared.
Figures


Update of
-
Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow.medRxiv [Preprint]. 2023 Feb 26:2023.02.21.23285886. doi: 10.1101/2023.02.21.23285886. medRxiv. 2023. Update in: J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659. PMID: 36865204 Free PMC article. Updated. Preprint.
Similar articles
-
Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow.medRxiv [Preprint]. 2023 Feb 26:2023.02.21.23285886. doi: 10.1101/2023.02.21.23285886. medRxiv. 2023. Update in: J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659. PMID: 36865204 Free PMC article. Updated. Preprint.
-
Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports.Sci Rep. 2024 Apr 23;14(1):9330. doi: 10.1038/s41598-024-58760-x. Sci Rep. 2024. PMID: 38654011 Free PMC article.
-
Optimizing ChatGPT's Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study.JMIR Form Res. 2024 Oct 1;8:e51383. doi: 10.2196/51383. JMIR Form Res. 2024. PMID: 39353189 Free PMC article.
-
The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.J Med Internet Res. 2024 Nov 5;26:e56532. doi: 10.2196/56532. J Med Internet Res. 2024. PMID: 39499913 Free PMC article.
-
Utility of artificial intelligence-based large language models in ophthalmic care.Ophthalmic Physiol Opt. 2024 May;44(3):641-671. doi: 10.1111/opo.13284. Epub 2024 Feb 25. Ophthalmic Physiol Opt. 2024. PMID: 38404172 Review.
Cited by
-
Evaluating ChatGPT as a Patient Education Tool for COVID-19-Induced Olfactory Dysfunction.OTO Open. 2024 Sep 15;8(3):e70011. doi: 10.1002/oto2.70011. eCollection 2024 Jul-Sep. OTO Open. 2024. PMID: 39286736 Free PMC article.
-
Effects of Large Language Model-Based Offerings on the Well-Being of Students: Qualitative Study.JMIR Form Res. 2024 Dec 27;8:e64081. doi: 10.2196/64081. JMIR Form Res. 2024. PMID: 39729617 Free PMC article.
-
Analysis of ChatGPT Responses to Ophthalmic Cases: Can ChatGPT Think like an Ophthalmologist?Ophthalmol Sci. 2024 Aug 23;5(1):100600. doi: 10.1016/j.xops.2024.100600. eCollection 2025 Jan-Feb. Ophthalmol Sci. 2024. PMID: 39346575 Free PMC article.
-
Me vs. the machine? Subjective evaluations of human- and AI-generated advice.Sci Rep. 2025 Feb 1;15(1):3980. doi: 10.1038/s41598-025-86623-6. Sci Rep. 2025. PMID: 39893236 Free PMC article.
-
Synthetic medical education in dermatology leveraging generative artificial intelligence.NPJ Digit Med. 2025 May 4;8(1):247. doi: 10.1038/s41746-025-01650-x. NPJ Digit Med. 2025. PMID: 40320492 Free PMC article.
References
-
- Xu L, Sanders L, Li K, Chow JCL. Chatbot for health care and oncology applications using artificial intelligence and machine learning: systematic review. JMIR Cancer. 2021 Nov 29;7(4):e27850. doi: 10.2196/27850. https://cancer.jmir.org/2021/4/e27850/ v7i4e27850 - DOI - PMC - PubMed
-
- Chonde DB, Pourvaziri A, Williams J, McGowan J, Moskos M, Alvarez C, Narayan AK, Daye D, Flores EJ, Succi MD. RadTranslate: an artificial intelligence-powered intervention for urgent imaging to enhance care equity for patients with limited English proficiency during the COVID-19 pandemic. J Am Coll Radiol. 2021 Jul;18(7):1000–1008. doi: 10.1016/j.jacr.2021.01.013. https://europepmc.org/abstract/MED/33609456 S1546-1440(21)00032-6 - DOI - PMC - PubMed
-
- Chung J, Kim D, Choi J, Yune S, Song K, Kim S, Chua M, Succi MD, Conklin J, Longo MGF, Ackman JB, Petranovic M, Lev MH, Do S. Prediction of oxygen requirement in patients with COVID-19 using a pre-trained chest radiograph xAI model: efficient development of auditable risk prediction models via a fine-tuning approach. Sci Rep. 2022 Dec 07;12(1):21164. doi: 10.1038/s41598-022-24721-5. doi: 10.1038/s41598-022-24721-5.10.1038/s41598-022-24721-5 - DOI - DOI - PMC - PubMed
-
- Li M, Arun N, Aggarwal M, Gupta Sharut, Singh Praveer, Little Brent P, Mendoza Dexter P, Corradi Gustavo C A, Takahashi Marcelo S, Ferraciolli Suely F, Succi Marc D, Lang Min, Bizzo Bernardo C, Dayan Ittai, Kitamura Felipe C, Kalpathy-Cramer Jayashree. Multi-population generalizability of a deep learning-based chest radiograph severity score for COVID-19. Medicine (Baltimore) 2022 Jul 22;101(29):e29587. doi: 10.1097/MD.0000000000029587. https://europepmc.org/abstract/MED/35866818 00005792-202207220-00064 - DOI - PMC - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources