. 2024 Sep 28;7(1):258.

doi: 10.1038/s41746-024-01258-7.

A framework for human evaluation of large language models in healthcare derived from literature review

Thomas Yu Chow Tam^#¹, Sonish Sivarajkumar^#², Sumit Kapoor³, Alisa V Stolyar¹, Katelyn Polanska¹, Karleigh R McCarthy¹, Hunter Osterhoudt¹, Xizhi Wu¹, Shyam Visweswaran^{4

5}, Sunyang Fu⁶, Piyush Mathur^{7

8}, Giovanni E Cacciamani⁹, Cong Sun¹⁰, Yifan Peng¹⁰, Yanshan Wang^{11

12

13

14

15}

Affiliations

¹ Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
² Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.
³ Department of Critical Care Medicine, University of Pittsburgh Medical Center, Pittsburgh, PA, USA.
⁴ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA.
⁶ Department of Clinical and Health Informatics, Center for Translational AI Excellence and Applications in Medicine, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁷ Department of Anesthesiology, Cleveland Clinic, Cleveland, OH, USA.
⁸ BrainX AI ReSearch, BrainX LLC, Cleveland, OH, USA.
⁹ Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
¹⁰ Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
¹¹ Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
¹² Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
¹³ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
¹⁴ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
¹⁵ Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.

^# Contributed equally.

PMID: 39333376
PMCID: PMC11437138
DOI: 10.1038/s41746-024-01258-7

A framework for human evaluation of large language models in healthcare derived from literature review

Thomas Yu Chow Tam et al. NPJ Digit Med. 2024.

. 2024 Sep 28;7(1):258.

doi: 10.1038/s41746-024-01258-7.

Authors

Affiliations

¹ Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
² Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.
³ Department of Critical Care Medicine, University of Pittsburgh Medical Center, Pittsburgh, PA, USA.
⁴ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA.
⁶ Department of Clinical and Health Informatics, Center for Translational AI Excellence and Applications in Medicine, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁷ Department of Anesthesiology, Cleveland Clinic, Cleveland, OH, USA.
⁸ BrainX AI ReSearch, BrainX LLC, Cleveland, OH, USA.
⁹ Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
¹⁰ Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
¹¹ Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
¹² Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
¹³ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
¹⁴ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
¹⁵ Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.

^# Contributed equally.

PMID: 39333376
PMCID: PMC11437138
DOI: 10.1038/s41746-024-01258-7

Abstract

With generative artificial intelligence (GenAI), particularly large language models (LLMs), continuing to make inroads in healthcare, assessing LLMs with human evaluations is essential to assuring safety and effectiveness. This study reviews existing literature on human evaluation methodologies for LLMs in healthcare across various medical specialties and addresses factors such as evaluation dimensions, sample types and sizes, selection, and recruitment of evaluators, frameworks and metrics, evaluation process, and statistical analysis type. Our literature review of 142 studies shows gaps in reliability, generalizability, and applicability of current human evaluation practices. To overcome such significant obstacles to healthcare LLM developments and deployments, we propose QUEST, a comprehensive and practical framework for human evaluation of LLMs covering three phases of workflow: Planning, Implementation and Adjudication, and Scoring and Review. QUEST is designed with five proposed evaluation principles: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence.

PubMed Disclaimer

Conflict of interest statement

P.M. has ownership and equity in BrainX, LLC, Y.W. has ownership and equity in BonafideNLP, LLC, and S.V. has ownership and equity in Kvatchii, Ltd., READE.ai, Inc., and ThetaRho, Inc. The other authors declare no competing interests.

Figures

**Fig. 1. Healthcare applications of LLMs.**
The reviewed studies showcased a diverse range of healthcare applications for LLMs from bench to bedside and beyond, each aiming to enhance different aspects of patient care and clinical practice, biomedical and health sciences research, and education.

**Fig. 2. Top 10 medical specialties.**
The literature review revealed a diverse range of medical specialties leveraging LLMs, with Radiology the leading specialty. Urology and General Surgery also emerged as prominent specialties, along with Plastic Surgery, Otolaryngology, Ophthalmology, Orthopedic Surgery and Psychiatry, while other specialties had fewer than 5 articles each. This distribution highlights the broad interest and exploration of LLMs across various medical domains, indicating the potential for transformative impacts in multiple areas of healthcare, and the need for comprehensive human evaluation in these areas.

**Fig. 3. Number of evaluation samples.**
The left panel shows the distribution of sample size for all studies while the right panel depicts the distribution for studies with 1–100 sample(s).

**Fig. 4. Number of human evaluators.**
The left panel shows the distribution of the number of human evaluators for all studies while the right panel depicts the distribution for studies with 1–20 human evaluator(s).

**Fig. 5. Number of evaluation samples vs. number of human evaluators.**
An inverse relationship is found between evaluation sample size and the number of human evaluators as reported in the reviewed studies. This exhibits a potential challenge in recruiting a large number of evaluators who have the capacity and/or capability to evaluate a high quantity of samples.

**Fig. 6. Decision tree for choosing the right statistical test based on type of assessment and metric.**
The choice of specific statistical tests is based on the type of data and the evaluation objectives within the context of each study. Parametric tests such as t-tests and ANOVA are chosen when the data are normally distributed and the goal is to compare means between groups, ensuring that the means of different groups are statistically analyzed to identify significant differences. Non-parametric tests like the Mann–Whitney U test and Kruskal–Wallis test are used when the data do not meet normality assumptions, providing robust alternatives for comparing medians or distributions between groups for ordinal or non-normally distributed data. Chi-Square and Fisher’s Exact tests are suitable for analyzing categorical data and assessing associations or goodness-of-fit between observed and expected frequencies, making them appropriate for evaluating the fit between LLM-generated medical evidence and clinical guidelines. Measures like Cohen’s Kappa and ICC are utilized to assess inter-rater reliability, ensuring that the agreement between evaluators is not due to chance and enhancing the reliability of the evaluation results.

**Fig. 7. The proposed QUEST human evaluation framework, delineating the multi-stage process for evaluating healthcare-related LLMs.**
The QUEST Human Evaluation Framework is derived from our literature review and is a comprehensive and standardized human evaluation framework for assessing LLMs in healthcare applications. It adheres to the QUEST dimensions and is designed for broad adoption by the community. It entails three phases, namely Planning, Implementation and Adjudication, and Scoring and Review.

**Fig. 8. A visual summary of the application of QUEST human evaluation framework in the cases of ED clinical note summarization and ED triage decision support in a healthcare system.**
Two use cases, clinical note summarization and patient triage applications, are provided as an example to showcase the applicability of QUEST Human Evaluation Framework in different applications in the healthcare system. Detailed summary is provided for each step in the three phases: Planning, Implementation and Adjudication, and Scoring and Review.

**Fig. 9. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram of the article screening and identification process.**
The initial search yielded 795 articles after applying language and publication year filters. Exclusion criteria were set to omit articles types irrelevant to our research aims, resulting in 688 potentially relevant articles. To ascertain focus on LLMs in healthcare, articles underwent a two-stage screening process. The first stage involved title and abstract screening to identify articles explicitly discussing human evaluation of LLM within healthcare contexts. The second stage involved a full-text review, emphasizing methodological detail, particularly regarding human evaluation of LLMs, and their applicability to healthcare. Due to accessibility issues, 42 articles were excluded, resulting in a final selection of 142 articles for the comprehensive literature review.

See this image and copyright information in PMC

References

1. Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst.35, 27730–27744 (2022).
1. Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
1. Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med.183, 589–596 (2023). - DOI - PMC - PubMed
1. Chari, S. et al. Informing clinical assessment by contextualizing post-hoc explanations of risk prediction models in type-2 diabetes. Artif. Intell. Med.137, 102498 (2023). - DOI - PubMed
1. Liu, S. et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J. Am. Med. Inform. Assoc. JAMIA30, 1237–1245 (2023). - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- ClinicalTrials.gov

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A framework for human evaluation of large language models in healthcare derived from literature review

Affiliations

A framework for human evaluation of large language models in healthcare derived from literature review

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Medical