Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis
- PMID: 38462064
- DOI: 10.1016/j.jbi.2024.104620
Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis
Abstract
Objective: Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in answering medical questions and provide direction for future research.
Methods: An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was "ChatGPT," without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327.
Results: A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %-60 %, I2 = 87 %) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency.
Conclusion: This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results' reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.
Keywords: ChatGPT; Evaluation; Large language model; Medicine.
Copyright © 2024 Elsevier Inc. All rights reserved.
Conflict of interest statement
Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Similar articles
-
Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8. Int J Nurs Stud. 2024. PMID: 38401366
-
Application of Large Language Models in Medical Training Evaluation-Using ChatGPT as a Standardized Patient: Multimetric Assessment.J Med Internet Res. 2025 Jan 1;27:e59435. doi: 10.2196/59435. J Med Internet Res. 2025. PMID: 39742453 Free PMC article.
-
ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions.JMIR Med Educ. 2023 Dec 5;9:e49183. doi: 10.2196/49183. JMIR Med Educ. 2023. PMID: 38051578 Free PMC article.
-
Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis.J Biomed Sci. 2025 May 7;32(1):45. doi: 10.1186/s12929-025-01131-z. J Biomed Sci. 2025. PMID: 40335969 Free PMC article.
-
Can ChatGPT-3.5 Pass a Medical Exam? A Systematic Review of ChatGPT's Performance in Academic Testing.J Med Educ Curric Dev. 2024 Mar 13;11:23821205241238641. doi: 10.1177/23821205241238641. eCollection 2024 Jan-Dec. J Med Educ Curric Dev. 2024. PMID: 38487300 Free PMC article. Review.
Cited by
-
Evaluating a large language model's ability to answer clinicians' requests for evidence summaries.J Med Libr Assoc. 2025 Jan 14;113(1):65-77. doi: 10.5195/jmla.2025.1985. J Med Libr Assoc. 2025. PMID: 39975503 Free PMC article.
-
Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.Front Digit Health. 2025 Jun 27;7:1574287. doi: 10.3389/fdgth.2025.1574287. eCollection 2025. Front Digit Health. 2025. PMID: 40657647 Free PMC article.
-
Perceptions and Attitudes of Chinese Oncologists Toward Endorsing AI-Driven Chatbots for Health Information Seeking Among Patients with Cancer: Phenomenological Qualitative Study.J Med Internet Res. 2025 Jul 23;27:e71418. doi: 10.2196/71418. J Med Internet Res. 2025. PMID: 40699917 Free PMC article.
-
Assessing the Accuracy, Completeness and Safety of ChatGPT-4o Responses on Pressure Injuries in Infants: Clinical Applications and Future Implications.Nurs Rep. 2025 Apr 14;15(4):130. doi: 10.3390/nursrep15040130. Nurs Rep. 2025. PMID: 40333050 Free PMC article.
-
Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics.Sci Rep. 2025 Mar 26;15(1):10426. doi: 10.1038/s41598-025-95233-1. Sci Rep. 2025. PMID: 40140500 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials