Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 Sep 4;15(9):e099301.
doi: 10.1136/bmjopen-2025-099301.

Quality and efficiency of integrating customised large language model-generated summaries versus physician-written summaries: a validation study

Collaborators, Affiliations
Comparative Study

Quality and efficiency of integrating customised large language model-generated summaries versus physician-written summaries: a validation study

Rosanne C Schoonbeek et al. BMJ Open. .

Abstract

Objectives: To compare the quality and time efficiency of physician-written summaries with customised large language model (LLM)-generated medical summaries integrated into the electronic health record (EHR) in a non-English clinical environment.

Design: Cross-sectional non-inferiority validation study.

Setting: Tertiary academic hospital.

Participants: 52 physicians from 8 specialties at a large Dutch academic hospital participated, either in writing summaries (n=42) or evaluating them (n=10).

Interventions: Physician writers wrote summaries of 50 patient records. LLM-generated summaries were created for the same records using an EHR-integrated LLM. An independent, blinded panel of physician evaluators compared physician-written summaries to LLM-generated summaries.

Primary and secondary outcome measures: Primary outcome measures were completeness, correctness and conciseness (on a 5-point Likert scale). Secondary outcomes were preference and trust, and time to generate either the physician-written or LLM-generated summary.

Results: The completeness and correctness of LLM-generated summaries did not differ significantly from physician-written summaries. However, LLM summaries were less concise (3.0 vs 3.5, p=0.001). Overall evaluation scores were similar (3.4 vs 3.3, p=0.373), with 57% of evaluators preferring LLM-generated summaries. Trust in both summary types was comparable, and interobserver variability showed excellent reliability (intraclass correlation coefficient 0.975). Physicians took an average of 7 min per summary, while LLMs completed the same task in just 15.7 s.

Conclusions: LLM-generated summaries are comparable to physician-written summaries in completeness and correctness, although slightly less concise. With a clear time-saving benefit, LLMs could help reduce clinicians' administrative burden without compromising summary quality.

Keywords: Artificial Intelligence; Electronic Health Records; Physicians.

PubMed Disclaimer

Conflict of interest statement

Competing interests: None declared.

Figures

Figure 1
Figure 1. The non-inferiority study design. Online supplemental material includes further explanation on the numbers. LLM, large language model.
Figure 2
Figure 2. Example of summaries (left panel) and their corresponding evaluation by the physician evaluators (right panel). In red: mistake by physician, in green: additional valuable information in physician summary. Translated from Dutch to English for illustration purposes: original Dutch text available (online supplemental material). AI, artificial intelligence.
Figure 3
Figure 3. Recognition, preference and trust infographic. AI, artificial intelligence.

References

    1. Raza MM, Venkatesh KP, Kvedar JC. Generative AI and large language models in health care: pathways to implementation. NPJ Digit Med. 2024;7:62.:62. doi: 10.1038/s41746-023-00988-4. - DOI - PMC - PubMed
    1. Yu P, Xu H, Hu X, et al. Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration. Healthcare (Basel) 2023;11:2776. doi: 10.3390/healthcare11202776. - DOI - PMC - PubMed
    1. Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. 2024;30:1134–42. doi: 10.1038/s41591-024-02855-5. - DOI - PMC - PubMed
    1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. Large language models in medicine. Nat Med. 2023;29:1930–40. doi: 10.1038/s41591-023-02448-8. - DOI - PubMed
    1. OpenAI GPT-4 technical report. 2023. [25-Jul-2025]. https://cdn.openai.com/papers/gpt-4.pdf Available. Accessed.

LinkOut - more resources