Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 23;7(1):102.
doi: 10.1038/s41746-024-01091-y.

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Affiliations

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Simone Kresevic et al. NPJ Digit Med. .

Abstract

Large language models (LLMs) can potentially transform healthcare, particularly in providing the right information to the right provider at the right time in the hospital workflow. This study investigates the integration of LLMs into healthcare, specifically focusing on improving clinical decision support systems (CDSSs) through accurate interpretation of medical guidelines for chronic Hepatitis C Virus infection management. Utilizing OpenAI's GPT-4 Turbo model, we developed a customized LLM framework that incorporates retrieval augmented generation (RAG) and prompt engineering. Our framework involved guideline conversion into the best-structured format that can be efficiently processed by LLMs to provide the most accurate output. An ablation study was conducted to evaluate the impact of different formatting and learning strategies on the LLM's answer generation accuracy. The baseline GPT-4 Turbo model's performance was compared against five experimental setups with increasing levels of complexity: inclusion of in-context guidelines, guideline reformatting, and implementation of few-shot learning. Our primary outcome was the qualitative assessment of accuracy based on expert review, while secondary outcomes included the quantitative measurement of similarity of LLM-generated responses to expert-provided answers using text-similarity scores. The results showed a significant improvement in accuracy from 43 to 99% (p < 0.001), when guidelines were provided as context in a coherent corpus of text and non-text sources were converted into text. In addition, few-shot learning did not seem to improve overall accuracy. The study highlights that structured guideline reformatting and advanced prompt engineering (data quality vs. data quantity) can enhance the efficacy of LLM integrations to CDSSs for guideline delivery.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Qualitative evaluation of accuracy among all experiments from baseline.
a Accuracy for all questions. b Accuracy only for text-based questions. c Accuracy for table-based questions. d Accuracy for clinical scenario-based questions. Statistical testing is based on pairwise comparison (Chi-Squared Test) between each experimental setting and the baseline.
Fig. 2
Fig. 2. Example of a clinical decision support system integrated with large language models.
When a patient is being evaluated for HCV treatment, the doctor prescribes several tests (laboratory and imaging), whose results are stored in the institutional EHR system. The locally hosted LLM has a standardized clinical scenario prompt with laboratory and imaging values that are directly extracted from EHR. Afterward, the standardized prompt is queried to the LLM, which has access to the relevant guidelines to recommend the most appropriate treatment. HCV Hepatitis C virus, EHR electronic health record, RAG retrieval augmented generation, LLM large language model.
Fig. 3
Fig. 3
Depiction of Ablation Study experimental settings (Experiment 1 through Experiment 5) to investigate how guideline reformatting, prompt architecture, and few-shot learning impact the accuracy and robustness of LLM outputs.

References

    1. Peng C, et al. A study of generative large language model for medical research and healthcare. NPJ Digit. Med. 2023;6:210. doi: 10.1038/s41746-023-00958-w. - DOI - PMC - PubMed
    1. Thirunavukarasu AJ, et al. Large language models in medicine. Nat. Med. 2023;29:1930–1940. doi: 10.1038/s41591-023-02448-8. - DOI - PubMed
    1. Meskó B, et al. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 2023;6:120. doi: 10.1038/s41746-023-00873-0. - DOI - PMC - PubMed
    1. Singhal K, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. doi: 10.1038/s41586-023-06291-2. - DOI - PMC - PubMed
    1. Webster P. Six ways large language models are changing healthcare. Nat. Med. 2023;29:2969–2971. doi: 10.1038/s41591-023-02700-1. - DOI - PubMed