. 2024 Jun:248:489-505.

Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

Hiba Ahsan¹, Denis Jered McInerney¹, Jisoo Kim², Christopher Potter², Geoffrey Young², Silvio Amir¹, Byron C Wallace¹

Affiliations

PMID: 39224857
PMCID: PMC11368037

Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

Hiba Ahsan et al. Proc Mach Learn Res. 2024 Jun.

. 2024 Jun:248:489-505.

Authors

Hiba Ahsan¹, Denis Jered McInerney¹, Jisoo Kim², Christopher Potter², Geoffrey Young², Silvio Amir¹, Byron C Wallace¹

Affiliations

¹ Northeastern University, Boston, MA.
² Brigham and Women's Hospital, Boston, MA.

PMID: 39224857
PMCID: PMC11368037

Abstract

Unstructured data in Electronic Health Records (EHRs) often contains critical information-complementary to imaging-that could inform radiologists' diagnoses. But the large volume of notes often associated with patients together with time constraints renders manually identifying relevant evidence practically infeasible. In this work we propose and evaluate a zero-shot strategy for using LLMs as a mechanism to efficiently retrieve and summarize unstructured evidence in patient EHR relevant to a given query. Our method entails tasking an LLM to infer whether a patient has, or is at risk of, a particular condition on the basis of associated notes; if so, we ask the model to summarize the supporting evidence. Under expert evaluation, we find that this LLM-based approach provides outputs consistently preferred to a pre-LLM information retrieval baseline. Manual evaluation is expensive, so we also propose and validate a method using an LLM to evaluate (other) LLM outputs for this task, allowing us to scale up evaluation. Our findings indicate the promise of LLMs as interfaces to EHR, but also highlight the outstanding challenge posed by "hallucinations". In this setting, however, we show that model confidence in outputs strongly correlates with faithful summaries, offering a practical means to limit confabulations.

PubMed Disclaimer

Figures

**Figure 6:**
Screenshot of the evaluation interface showing highlighted evidence.

**Figure 1:**
Proposed prompting strategy to identify and summarize evidence relevant to a given query diagnosis using LLMs. We first ask if the patient has (or is at risk of) a condition, then elicit a summary of supporting evidence if so.

**Figure 2:**
Data sampling flow-chart. An instance is a unique (patient, diagnosis) combination.

**Figure 3:**
Evidence generated by the LLMs is more often deemed useful than that retrieved by CBERT. But on average, 9.4% and 4.9% of evidence by Flan-T5 and Mistral-Instruct respectively are hallucinated.

**Figure 4:**
Distributions of normalized likelihood, for present and hallucinated evidence. The score provides good discrimination of “hallucinated” evidence from present evidence (yielding AUCs of >0.9).

**Figure 5:**
Automatic LLM-based evaluation of retrieved evidence. The evaluator LLM: (1) extracts risk factors from the evidence; (2) verifies the presence of each in the note; and (3) validates each present risk factor. The same approach is adopted for evaluating signs of the query diagnosis.

See this image and copyright information in PMC

References

1. Adams Griffin, Alsentzer Emily, Ketenci Mert, Zucker Jason, and Elhadad Noémie. What’s in a summary? laying the groundwork for advances in hospital-course summarization. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2021, page 4794. NIH Public Access, 2021. - PMC - PubMed
1. Agrawal Monica, Hegselmann Stefan, Lang Hunter, Kim Yoon, and Sontag David. Large language models are zero-shot clinical information extractors. arXiv preprint arXiv:2205.12689, 2022.
1. Alsentzer Emily, Murphy John, Boag William, Weng Wei-Hung, Jindi Di, Naumann Tristan, and McDermott Matthew. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-1909. URL https://aclanthology.org/W19-1909. - DOI
1. Azamfirei Razvan, Kudchadkar Sapna R, and Fackler James. Large language models and the perils of their hallucinations. Critical Care, 27(1):1–2, 2023. - PMC - PubMed
1. Chang Yapei, Lo Kyle, Goyal Tanya, and Iyyer Mohit. Booookscore: A systematic exploration of book-length summarization in the era of llms. arXiv preprint arXiv:2310.00785, 2023.

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

Affiliations

Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources