Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 21:27:e67967.
doi: 10.2196/67967.

Large Language Model-Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions: Development and Validation Study

Affiliations

Large Language Model-Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions: Development and Validation Study

Verity Schaye et al. J Med Internet Res. .

Abstract

Background: Clinical reasoning (CR) is an essential skill; yet, physicians often receive limited feedback. Artificial intelligence holds promise to fill this gap.

Objective: We report the development of named entity recognition (NER), logic-based and large language model (LLM)-based assessments of CR documentation in the electronic health record across 2 institutions (New York University Grossman School of Medicine [NYU] and University of Cincinnati College of Medicine [UC]).

Methods: The note corpus consisted of internal medicine resident admission notes (retrospective set: July 2020-December 2021, n=700 NYU and 450 UC notes and prospective validation set: July 2023-December 2023, n=155 NYU and 92 UC notes). Clinicians rated CR documentation quality in each note using a previously validated tool (Revised-IDEA), on 3-point scales across 2 domains: differential diagnosis (D0, D1, and D2) and explanation of reasoning, (EA0, EA1, and EA2). At NYU, the retrospective set was annotated for NER for 5 entities (diagnosis, diagnostic category, prioritization of diagnosis language, data, and linkage terms). Models were developed using different artificial intelligence approaches, including NER, logic-based model: a large word vector model (scispaCy en_core_sci_lg) with model weights adjusted with backpropagation from annotations, developed at NYU with external validation at UC, NYUTron LLM: an NYU internal 110 million parameter LLM pretrained on 7.25 million clinical notes, only validated at NYU, and GatorTron LLM: an open source 345 million parameter LLM pretrained on 82 billion words of clinical text, fined tuned on NYU retrospective sets, then externally validated and further fine-tuned at UC. Model performance was assessed in the prospective sets with F1-scores for the NER, logic-based model and area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) for the LLMs.

Results: At NYU, the NYUTron LLM performed best: the D0 and D2 models had AUROC/AUPRC 0.87/0.79 and 0.89/0.86, respectively. The D1, EA0, and EA1 models had insufficient performance for implementation (AUROC range 0.57-0.80, AUPRC range 0.33-0.63). For the D1 classification, the approach pivoted to a stepwise approach taking advantage of the more performant D0 and D2 models. For the EA model, the approach pivoted to a binary EA2 model (ie, EA2 vs not EA2) with excellent performance, AUROC/AUPRC 0.85/ 0.80. At UC, the NER, D-logic-based model was the best performing D model (F1-scores 0.80, 0.74, and 0.80 for D0, D1, D2, respectively. The GatorTron LLM performed best for EA2 scores AUROC/AUPRC 0.75/ 0.69.

Conclusions: This is the first multi-institutional study to apply LLMs for assessing CR documentation in the electronic health record. Such tools can enhance feedback on CR. Lessons learned by implementing these models at distinct institutions support the generalizability of this approach.

Keywords: artificial intelligence; assessment; clinical reasoning; documentation; electronic health record; feedback; large language models.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Overview of the development and validation across 2 institutions of named entity recognition: logic-based assessment and large language model–based assessments of resident clinical reasoning documentation in the electronic health record. AUROC: area under the precision-recall curve; AURPC: area under the precision-recall curve; CV: cross-validation; LLM: large language model; NER: named entity recognition; NYU: New York University Grossman School of Medicine; UC: University of Cincinnati College of Medicine.
Figure 2
Figure 2
Example note (modified to protect patient privacy) with human rating of D and EA scores and annotation for named entity recognition of 5 entity types: 3 components of the D score (diagnosis [Dx], diagnostic category (DC], and prioritization of diagnosis language [Prior]) and 2 components of the EA score (data [Data] and linkage terms [Link]).
Figure 3
Figure 3
Large language model performance on prospective note sets classifying differential diagnosis (D0 and D2) in resident admission notes for best performing D models selected for implementation at New York University Grossman School of Medicine. AUC: area under the curve; FPR: false positive rate; NYU: New York University; PPV: positive predictive value; ROC: receiver operating characteristic; TPR: true positive rate.
Figure 4
Figure 4
Large language model performance on prospective note sets classifying explanation of reasoning (EA2) in resident admission notes for best performing EA models selected for implementation at New York University Grossman School of Medicine. AUC: area under the curve; FPR: false positive rate; NYU: New York University; PPV: positive predictive value; ROC: receiver operating characteristic; TPR: true positive rate.
Figure 5
Figure 5
Large language model performance on prospective note sets classifying explanation of reasoning (EA2) in resident admission notes for best performing EA models selected for implementation at University of Cincinnati College of Medicine. AUC: area under the curve; FPR: false positive rate; PPV: positive predictive value; ROC: receiver operating characteristic; TPR: true positive rate; UC: University of Cincinnati.

References

    1. National Academies of Sciences. Engineering. Medicine . In: Improving Diagnosis in Health Care. Balogh EP, Miller BT, Ball JR, editors. Washington (DC): National Academies Press; 2015. - PubMed
    1. Cadieux DC, Goldszmidt M. It's not just what you know: junior trainees' approach to follow-up and documentation. Med Educ. 2017;51(8):812–825. doi: 10.1111/medu.13286. https://europepmc.org/abstract/MED/28418205 - DOI - PMC - PubMed
    1. Singh H, Giardina TD, Meyer AND, Forjuoh SN, Reis MD, Thomas EJ. Types and origins of diagnostic errors in primary care settings. JAMA Intern Med. 2013;173(6):418–425. doi: 10.1001/jamainternmed.2013.2777. https://europepmc.org/abstract/MED/23440149 1656540 - DOI - PMC - PubMed
    1. Schiff GD, Bates DW. Can electronic clinical documentation help prevent diagnostic errors? N Engl J Med. 2010;362(12):1066–1069. doi: 10.1056/NEJMp0911734.362/12/1066 - DOI - PubMed
    1. Schaye V, Miller L, Kudlowitz D, Chun J, Burk-Rafel J, Cocks P, Guzman B, Aphinyanaphongs Y, Marin M. Development of a clinical reasoning documentation assessment tool for resident and fellow admission notes: a shared mental model for feedback. J Gen Intern Med. 2022;37(3):507–512. doi: 10.1007/s11606-021-06805-6. https://europepmc.org/abstract/MED/33945113 10.1007/s11606-021-06805-6 - DOI - PMC - PubMed

Publication types

LinkOut - more resources