Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 28;7(1):258.
doi: 10.1038/s41746-024-01258-7.

A framework for human evaluation of large language models in healthcare derived from literature review

Affiliations

A framework for human evaluation of large language models in healthcare derived from literature review

Thomas Yu Chow Tam et al. NPJ Digit Med. .

Abstract

With generative artificial intelligence (GenAI), particularly large language models (LLMs), continuing to make inroads in healthcare, assessing LLMs with human evaluations is essential to assuring safety and effectiveness. This study reviews existing literature on human evaluation methodologies for LLMs in healthcare across various medical specialties and addresses factors such as evaluation dimensions, sample types and sizes, selection, and recruitment of evaluators, frameworks and metrics, evaluation process, and statistical analysis type. Our literature review of 142 studies shows gaps in reliability, generalizability, and applicability of current human evaluation practices. To overcome such significant obstacles to healthcare LLM developments and deployments, we propose QUEST, a comprehensive and practical framework for human evaluation of LLMs covering three phases of workflow: Planning, Implementation and Adjudication, and Scoring and Review. QUEST is designed with five proposed evaluation principles: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence.

PubMed Disclaimer

Conflict of interest statement

P.M. has ownership and equity in BrainX, LLC, Y.W. has ownership and equity in BonafideNLP, LLC, and S.V. has ownership and equity in Kvatchii, Ltd., READE.ai, Inc., and ThetaRho, Inc. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Healthcare applications of LLMs.
The reviewed studies showcased a diverse range of healthcare applications for LLMs from bench to bedside and beyond, each aiming to enhance different aspects of patient care and clinical practice, biomedical and health sciences research, and education.
Fig. 2
Fig. 2. Top 10 medical specialties.
The literature review revealed a diverse range of medical specialties leveraging LLMs, with Radiology the leading specialty. Urology and General Surgery also emerged as prominent specialties, along with Plastic Surgery, Otolaryngology, Ophthalmology, Orthopedic Surgery and Psychiatry, while other specialties had fewer than 5 articles each. This distribution highlights the broad interest and exploration of LLMs across various medical domains, indicating the potential for transformative impacts in multiple areas of healthcare, and the need for comprehensive human evaluation in these areas.
Fig. 3
Fig. 3. Number of evaluation samples.
The left panel shows the distribution of sample size for all studies while the right panel depicts the distribution for studies with 1–100 sample(s).
Fig. 4
Fig. 4. Number of human evaluators.
The left panel shows the distribution of the number of human evaluators for all studies while the right panel depicts the distribution for studies with 1–20 human evaluator(s).
Fig. 5
Fig. 5. Number of evaluation samples vs. number of human evaluators.
An inverse relationship is found between evaluation sample size and the number of human evaluators as reported in the reviewed studies. This exhibits a potential challenge in recruiting a large number of evaluators who have the capacity and/or capability to evaluate a high quantity of samples.
Fig. 6
Fig. 6. Decision tree for choosing the right statistical test based on type of assessment and metric.
The choice of specific statistical tests is based on the type of data and the evaluation objectives within the context of each study. Parametric tests such as t-tests and ANOVA are chosen when the data are normally distributed and the goal is to compare means between groups, ensuring that the means of different groups are statistically analyzed to identify significant differences. Non-parametric tests like the Mann–Whitney U test and Kruskal–Wallis test are used when the data do not meet normality assumptions, providing robust alternatives for comparing medians or distributions between groups for ordinal or non-normally distributed data. Chi-Square and Fisher’s Exact tests are suitable for analyzing categorical data and assessing associations or goodness-of-fit between observed and expected frequencies, making them appropriate for evaluating the fit between LLM-generated medical evidence and clinical guidelines. Measures like Cohen’s Kappa and ICC are utilized to assess inter-rater reliability, ensuring that the agreement between evaluators is not due to chance and enhancing the reliability of the evaluation results.
Fig. 7
Fig. 7. The proposed QUEST human evaluation framework, delineating the multi-stage process for evaluating healthcare-related LLMs.
The QUEST Human Evaluation Framework is derived from our literature review and is a comprehensive and standardized human evaluation framework for assessing LLMs in healthcare applications. It adheres to the QUEST dimensions and is designed for broad adoption by the community. It entails three phases, namely Planning, Implementation and Adjudication, and Scoring and Review.
Fig. 8
Fig. 8. A visual summary of the application of QUEST human evaluation framework in the cases of ED clinical note summarization and ED triage decision support in a healthcare system.
Two use cases, clinical note summarization and patient triage applications, are provided as an example to showcase the applicability of QUEST Human Evaluation Framework in different applications in the healthcare system. Detailed summary is provided for each step in the three phases: Planning, Implementation and Adjudication, and Scoring and Review.
Fig. 9
Fig. 9. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram of the article screening and identification process.
The initial search yielded 795 articles after applying language and publication year filters. Exclusion criteria were set to omit articles types irrelevant to our research aims, resulting in 688 potentially relevant articles. To ascertain focus on LLMs in healthcare, articles underwent a two-stage screening process. The first stage involved title and abstract screening to identify articles explicitly discussing human evaluation of LLM within healthcare contexts. The second stage involved a full-text review, emphasizing methodological detail, particularly regarding human evaluation of LLMs, and their applicability to healthcare. Due to accessibility issues, 42 articles were excluded, resulting in a final selection of 142 articles for the comprehensive literature review.

References

    1. Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst.35, 27730–27744 (2022).
    1. Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
    1. Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med.183, 589–596 (2023). - PMC - PubMed
    1. Chari, S. et al. Informing clinical assessment by contextualizing post-hoc explanations of risk prediction models in type-2 diabetes. Artif. Intell. Med.137, 102498 (2023). - PubMed
    1. Liu, S. et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J. Am. Med. Inform. Assoc. JAMIA30, 1237–1245 (2023). - PMC - PubMed

LinkOut - more resources