Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 May 6:2025.04.22.25326219.
doi: 10.1101/2025.04.22.25326219.

Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge

Affiliations

Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge

Emma Croxford et al. medRxiv. .

Abstract

Electronic Health Records (EHRs) store vast amounts of clinical information that are difficult for healthcare providers to summarize and synthesize relevant details to their practice. To reduce cognitive load on providers, generative AI with Large Language Models have emerged to automatically summarize patient records into clear, actionable insights and offload the cognitive burden for providers. However, LLM summaries need to be precise and free from errors, making evaluations on the quality of the summaries necessary. While human experts are the gold standard for evaluations, their involvement is time-consuming and costly. Therefore, we introduce and validate an automated method for evaluating real-world EHR multi-document summaries using an LLM as the evaluator, referred to as LLM-as-a-Judge. Benchmarking against the validated Provider Documentation Summarization Quality Instrument (PDSQI)-9 for human evaluation, our LLM-as-a-Judge framework demonstrated strong inter-rater reliability with human evaluators. GPT-o3-mini achieved the highest intraclass correlation coefficient of 0.818 (95% CI 0.772, 0.854), with a median score difference of 0 from human evaluators, and completes evaluations in just 22 seconds. Overall, the reasoning models excelled in inter-rater reliability, particularly in evaluations that require advanced reasoning and domain expertise, outperforming non-reasoning models, those trained on the task, and multi-agent workflows. Cross-task validation on the Problem Summarization task similarly confirmed high reliability. By automating high-quality evaluations, medical LLM-as-a-Judge offers a scalable, efficient solution to rapidly identify accurate and safe AI-generated summaries in healthcare settings.

PubMed Disclaimer

Conflict of interest statement

Competing Interests Statement The authors have no competing interests to declare.

Figures

Figure 1:
Figure 1:. Study Overview
Five distinct training strategies for large language models using the PDSQI-9 instrument were evaluated. The experiments comprised expert-driven prompt engineering, supervised fine-tuning, direct preference optimization, and multi-agent architectures, representing the LLM-as-a-Judge framework for clinical summarization.
Figure 2:
Figure 2:. Prompt Development Overview
Figure 3:
Figure 3:. Absolute Differences between Human Evaluator and LLM-as-a-Judge scores
The differences in absolute scores between the human evaluators and either GPT-4o or GPT-o3-mini serving as the LLM-as-a-Judge across each attribute of the PDSQI-9 instrument.
Figure 4:
Figure 4:. System Messages for LLM agents in Multi-Agent Workflows

References

    1. Patterson B. W. et al. Call me Dr Ishmael: trends in electronic health record notes available at emergency department visits and admissions. JAMIA Open 7, ooae039. issn: 2574–2531 (Apr. 2024). - PMC - PubMed
    1. Poissant L., Pereira J., Tamblyn R. & Kawasumi Y. The Impact of Electronic Health Records on Time Efficiency of Physicians and Nurses: A Systematic Review. Journal of the American Medical Informatics Association: JAMIA 12, 505–516. issn: 1067–5027 (2005). - PMC - PubMed
    1. Semanik M. G. et al. Impact of a problem-oriented view on clinical data retrieval. Journal of the American Medical Informatics Association: JAMIA 28, 899–906. issn: 1067–5027 (Feb. 2021). - PMC - PubMed
    1. Ben-Assuli O., Sagi D., Leshno M., Ironi A. & Ziv A. Improving diagnostic accuracy using EHR in emergency departments: A simulation-based study. eng. Journal of Biomedical Informatics 55, 31–40. issn: 1532–0480 (June 2015). - PubMed
    1. Embi P. J. et al. Computerized provider documentation: findings and implications of a multisite study of clinicians and administrators. Journal of the American Medical Informatics Association: JAMIA 20, 718–726. issn: 1067–5027 (July 2013). - PMC - PubMed

Publication types

LinkOut - more resources