An evaluation framework for clinical use of large language models in patient interaction tasks
- PMID: 39747685
- DOI: 10.1038/s41591-024-03328-5
An evaluation framework for clinical use of large language models in patient interaction tasks
Abstract
The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4, GPT-3.5, Mistral and LLaMA-2-7b across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities of GPT-4V. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor-patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.
© 2025. The Author(s), under exclusive licence to Springer Nature America, Inc.
Conflict of interest statement
Competing interests: R.D. reports receiving personal fees from DWA, personal fees from Pfizer, personal fees from L’Oreal, personal fees from VisualDx and stock options from MDAlgorithms and Revea outside the submitted work and has a patent for TrueImage pending. D.I.S. is the co-founder of FixMySkin Healing Balms, a shareholder in Appiell Inc. and K-Health, a consultant for Appiell Inc. and LuminDx and an investigator for AbbVie and Sanofi. E.M.V.A. serves as an advisor to Enara Bio, Manifold Bio, Monte Rosa, Novartis Institute for Biomedical Research and Serinus Bio. E.M.V.A provides research support to Novartis, Bristol Myers Squibb, Sanofi and NextPoint. E.M.V.A. holds equity in Tango Therapeutics, Genome Medical, Genomic Life, Enara Bio, Manifold Bio, Microsoft, Monte Rosa, Riva Therapeutics, Serinus Bio and Syapse. E.M.V.A. has filed for institutional patents on chromatin mutations and immunotherapy response and methods for clinical interpretation and provides intermittent legal consulting on patents to Foaley & Hoag. E.M.V.A. also serves on the editorial board of Science Advances. The other authors declare no competing interests. Ethics Declaration: The CRAFT-MD framework is designed to enable faster evaluation of LLMs for leading clinical conversations and to uncover limitations to guide future model development. These LLMs could enhance clinical workflows by engaging in preliminary conversations with patients, collecting and summarizing relevant medical information and presenting these data to doctors before patient visits, potentially improving the effectiveness of doctor–patient interactions. These LLMs could be more effective than the pre-visit questionnaires, given their ability to lead dynamic conversations. However, this will require not only developing more capable LLMs but also making them more fault tolerant and cognizant of appropriate empathetic behavior.
References
-
- Lasser, K. E., Himmelstein, D. U. & Woolhandler, S. Access to care, health status, and health disparities in the United States and Canada: results of a cross-national population-based survey. Am. J. Public Health 96, 1300–1307 (2011). - DOI
-
- Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. https://doi.org/10.48550/arXiv.2303.12712 (2023).
MeSH terms
LinkOut - more resources
Full Text Sources
Medical