Evaluating a large language model's ability to answer clinicians' requests for evidence summaries
- PMID: 39975503
- PMCID: PMC11835037
- DOI: 10.5195/jmla.2025.1985
Evaluating a large language model's ability to answer clinicians' requests for evidence summaries
Abstract
Objective: This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses.
Methods: Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.
Results: Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated.
Conclusions: Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.
Keywords: Artificial Intelligence; Biomedical Informatics; Evidence Synthesis; Generative AI; Information Science; LLMs; Large Language Models; Library Science.
Copyright © 2025 Mallory N. Blasingame, Taneya Y. Koonce, Annette M. Williams, Dario A. Giuse, Jing Su, Poppy A. Krump, Nunzia Bettinsoli Giuse.
Figures
Update of
-
Evaluating a Large Language Model's Ability to Answer Clinicians' Requests for Evidence Summaries.medRxiv [Preprint]. 2024 May 3:2024.05.01.24306691. doi: 10.1101/2024.05.01.24306691. medRxiv. 2024. Update in: J Med Libr Assoc. 2025 Jan 14;113(1):65-77. doi: 10.5195/jmla.2025.1985. PMID: 38746273 Free PMC article. Updated. Preprint.
Similar articles
-
Evaluating a Large Language Model's Ability to Answer Clinicians' Requests for Evidence Summaries.medRxiv [Preprint]. 2024 May 3:2024.05.01.24306691. doi: 10.1101/2024.05.01.24306691. medRxiv. 2024. Update in: J Med Libr Assoc. 2025 Jan 14;113(1):65-77. doi: 10.5195/jmla.2025.1985. PMID: 38746273 Free PMC article. Updated. Preprint.
-
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655. J Med Internet Res. 2024. PMID: 38630520 Free PMC article.
-
Integrating PICO principles into generative artificial intelligence prompt engineering to enhance information retrieval for medical librarians.J Med Libr Assoc. 2025 Apr 18;113(2):184-188. doi: 10.5195/jmla.2025.2022. J Med Libr Assoc. 2025. PMID: 40342302 Free PMC article.
-
Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.Surg Obes Relat Dis. 2024 Jul;20(7):609-613. doi: 10.1016/j.soard.2024.04.014. Epub 2024 May 8. Surg Obes Relat Dis. 2024. PMID: 38782611 Review.
-
The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.J Med Internet Res. 2024 Nov 5;26:e56532. doi: 10.2196/56532. J Med Internet Res. 2024. PMID: 39499913 Free PMC article.
Cited by
-
The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review.J Am Med Inform Assoc. 2025 Jun 1;32(6):1071-1086. doi: 10.1093/jamia/ocaf063. J Am Med Inform Assoc. 2025. PMID: 40332983 Free PMC article.
References
-
- OpenAI. Introducing ChatGPT [Internet]. OpenAI; 2022. Nov 30 [cited 2024 Apr 25]. <https://openai.com/blog/chatgpt>.
-
- Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023. Jun 1;183(6):589–96. DOI: 10.1001/jamainternmed.2023.1838 - DOI - PMC - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources