Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun:248:489-505.

Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

Affiliations

Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

Hiba Ahsan et al. Proc Mach Learn Res. 2024 Jun.

Abstract

Unstructured data in Electronic Health Records (EHRs) often contains critical information-complementary to imaging-that could inform radiologists' diagnoses. But the large volume of notes often associated with patients together with time constraints renders manually identifying relevant evidence practically infeasible. In this work we propose and evaluate a zero-shot strategy for using LLMs as a mechanism to efficiently retrieve and summarize unstructured evidence in patient EHR relevant to a given query. Our method entails tasking an LLM to infer whether a patient has, or is at risk of, a particular condition on the basis of associated notes; if so, we ask the model to summarize the supporting evidence. Under expert evaluation, we find that this LLM-based approach provides outputs consistently preferred to a pre-LLM information retrieval baseline. Manual evaluation is expensive, so we also propose and validate a method using an LLM to evaluate (other) LLM outputs for this task, allowing us to scale up evaluation. Our findings indicate the promise of LLMs as interfaces to EHR, but also highlight the outstanding challenge posed by "hallucinations". In this setting, however, we show that model confidence in outputs strongly correlates with faithful summaries, offering a practical means to limit confabulations.

PubMed Disclaimer

Figures

Figure 6:
Figure 6:
Screenshot of the evaluation interface showing highlighted evidence.
Figure 1:
Figure 1:
Proposed prompting strategy to identify and summarize evidence relevant to a given query diagnosis using LLMs. We first ask if the patient has (or is at risk of) a condition, then elicit a summary of supporting evidence if so.
Figure 2:
Figure 2:
Data sampling flow-chart. An instance is a unique (patient, diagnosis) combination.
Figure 3:
Figure 3:
Evidence generated by the LLMs is more often deemed useful than that retrieved by CBERT. But on average, 9.4% and 4.9% of evidence by Flan-T5 and Mistral-Instruct respectively are hallucinated.
Figure 4:
Figure 4:
Distributions of normalized likelihood, for present and hallucinated evidence. The score provides good discrimination of “hallucinated” evidence from present evidence (yielding AUCs of >0.9).
Figure 4:
Figure 4:
Distributions of normalized likelihood, for present and hallucinated evidence. The score provides good discrimination of “hallucinated” evidence from present evidence (yielding AUCs of >0.9).
Figure 5:
Figure 5:
Automatic LLM-based evaluation of retrieved evidence. The evaluator LLM: (1) extracts risk factors from the evidence; (2) verifies the presence of each in the note; and (3) validates each present risk factor. The same approach is adopted for evaluating signs of the query diagnosis.

References

    1. Adams Griffin, Alsentzer Emily, Ketenci Mert, Zucker Jason, and Elhadad Noémie. What’s in a summary? laying the groundwork for advances in hospital-course summarization. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2021, page 4794. NIH Public Access, 2021. - PMC - PubMed
    1. Agrawal Monica, Hegselmann Stefan, Lang Hunter, Kim Yoon, and Sontag David. Large language models are zero-shot clinical information extractors. arXiv preprint arXiv:2205.12689, 2022.
    1. Alsentzer Emily, Murphy John, Boag William, Weng Wei-Hung, Jindi Di, Naumann Tristan, and McDermott Matthew. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-1909. URL https://aclanthology.org/W19-1909. - DOI
    1. Azamfirei Razvan, Kudchadkar Sapna R, and Fackler James. Large language models and the perils of their hallucinations. Critical Care, 27(1):1–2, 2023. - PMC - PubMed
    1. Chang Yapei, Lo Kyle, Goyal Tanya, and Iyyer Mohit. Booookscore: A systematic exploration of book-length summarization in the era of llms. arXiv preprint arXiv:2310.00785, 2023.

LinkOut - more resources