An evaluation framework for clinical use of large language models in patient interaction tasks

Shreya Johri^#¹, Jaehwan Jeong^#^{1

2}, Benjamin A Tran³, Daniel I Schlessinger⁴, Shannon Wongvibulsin⁵, Leandra A Barnes⁶, Hong-Yu Zhou¹, Zhuo Ran Cai⁶, Eliezer M Van Allen⁷, David Kim⁸, Roxana Daneshjou^{9

10}, Pranav Rajpurkar¹¹

Affiliations

¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
² Department of Computer Science, Stanford University, Stanford, CA, USA.
³ Department of Dermatology, Medstar Georgetown University Hospital/Washington Hospital Center, Washington, DC, USA.
⁴ Department of Dermatology, Northwestern University, Chicago, IL, USA.
⁵ Division of Dermatology, David Geffen School of Medicine at the University of California, Los Angeles, Los Angeles, CA, USA.
⁶ Department of Dermatology, Stanford University, Stanford, CA, USA.
⁷ Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA.
⁸ Department of Emergency Medicine, Stanford University, Stanford, CA, USA.
⁹ Department of Dermatology, Stanford University, Stanford, CA, USA. roxanad@stanford.edu.
¹⁰ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA. roxanad@stanford.edu.
¹¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. pranav_rajpurkar@hms.harvard.edu.

^# Contributed equally.

PMID: 39747685
DOI: 10.1038/s41591-024-03328-5

An evaluation framework for clinical use of large language models in patient interaction tasks

Shreya Johri et al. Nat Med. 2025 Jan.

. 2025 Jan;31(1):77-86.

doi: 10.1038/s41591-024-03328-5. Epub 2025 Jan 2.

Authors

Affiliations

¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
² Department of Computer Science, Stanford University, Stanford, CA, USA.
³ Department of Dermatology, Medstar Georgetown University Hospital/Washington Hospital Center, Washington, DC, USA.
⁴ Department of Dermatology, Northwestern University, Chicago, IL, USA.
⁵ Division of Dermatology, David Geffen School of Medicine at the University of California, Los Angeles, Los Angeles, CA, USA.
⁶ Department of Dermatology, Stanford University, Stanford, CA, USA.
⁷ Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA.
⁸ Department of Emergency Medicine, Stanford University, Stanford, CA, USA.
⁹ Department of Dermatology, Stanford University, Stanford, CA, USA. roxanad@stanford.edu.
¹⁰ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA. roxanad@stanford.edu.
¹¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. pranav_rajpurkar@hms.harvard.edu.

^# Contributed equally.

PMID: 39747685
DOI: 10.1038/s41591-024-03328-5

Abstract

The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4, GPT-3.5, Mistral and LLaMA-2-7b across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities of GPT-4V. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor-patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.

PubMed Disclaimer

Conflict of interest statement

Competing interests: R.D. reports receiving personal fees from DWA, personal fees from Pfizer, personal fees from L’Oreal, personal fees from VisualDx and stock options from MDAlgorithms and Revea outside the submitted work and has a patent for TrueImage pending. D.I.S. is the co-founder of FixMySkin Healing Balms, a shareholder in Appiell Inc. and K-Health, a consultant for Appiell Inc. and LuminDx and an investigator for AbbVie and Sanofi. E.M.V.A. serves as an advisor to Enara Bio, Manifold Bio, Monte Rosa, Novartis Institute for Biomedical Research and Serinus Bio. E.M.V.A provides research support to Novartis, Bristol Myers Squibb, Sanofi and NextPoint. E.M.V.A. holds equity in Tango Therapeutics, Genome Medical, Genomic Life, Enara Bio, Manifold Bio, Microsoft, Monte Rosa, Riva Therapeutics, Serinus Bio and Syapse. E.M.V.A. has filed for institutional patents on chromatin mutations and immunotherapy response and methods for clinical interpretation and provides intermittent legal consulting on patents to Foaley & Hoag. E.M.V.A. also serves on the editorial board of Science Advances. The other authors declare no competing interests. Ethics Declaration: The CRAFT-MD framework is designed to enable faster evaluation of LLMs for leading clinical conversations and to uncover limitations to guide future model development. These LLMs could enhance clinical workflows by engaging in preliminary conversations with patients, collecting and summarizing relevant medical information and presenting these data to doctors before patient visits, potentially improving the effectiveness of doctor–patient interactions. These LLMs could be more effective than the pre-visit questionnaires, given their ability to lead dynamic conversations. However, this will require not only developing more capable LLMs but also making them more fault tolerant and cognizant of appropriate empathetic behavior.

References

1. Lasser, K. E., Himmelstein, D. U. & Woolhandler, S. Access to care, health status, and health disparities in the United States and Canada: results of a cross-national population-based survey. Am. J. Public Health 96, 1300–1307 (2011). - DOI
1. Irving, G. et al. International variations in primary care physician consultation time: a systematic review of 67 countries. BMJ Open 7, e017902 (2017). - DOI - PubMed - PMC
1. Wong, J. L. C., Vincent, R. C. & Al-Sharqi, A. Dermatology consultations: how long do they take? Future Hosp. J. 4, 23–26 (2017). - DOI - PubMed - PMC
1. Shaver, J. The state of telehealth before and after the COVID-19 pandemic. Prim. Care 49, 517–530 (2022). - DOI - PubMed - PMC
1. Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. https://doi.org/10.48550/arXiv.2303.12712 (2023).

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An evaluation framework for clinical use of large language models in patient interaction tasks

Affiliations

An evaluation framework for clinical use of large language models in patient interaction tasks

Authors

Affiliations

Abstract

Conflict of interest statement

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical