Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Simone Kresevic^#^{1

2}, Mauro Giuffrè^#³, Milos Ajcevic⁴, Agostino Accardo⁴, Lory S Crocè⁵, Dennis L Shung⁶

Affiliations

¹ Department of Engineering and Architecture, University of Trieste, Trieste, Italy. simone.kresevic@phd.units.it.
² Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA. simone.kresevic@phd.units.it.
³ Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA. mauro.giuffre@yale.edu.
⁴ Department of Engineering and Architecture, University of Trieste, Trieste, Italy.
⁵ Department of Medical, Surgical, and Health Sciences, University of Trieste, Trieste, Italy.
⁶ Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA.

^# Contributed equally.

PMID: 38654102
PMCID: PMC11039454
DOI: 10.1038/s41746-024-01091-y

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Simone Kresevic et al. NPJ Digit Med. 2024.

. 2024 Apr 23;7(1):102.

doi: 10.1038/s41746-024-01091-y.

Authors

Simone Kresevic^#^{1

2}, Mauro Giuffrè^#³, Milos Ajcevic⁴, Agostino Accardo⁴, Lory S Crocè⁵, Dennis L Shung⁶

Affiliations

¹ Department of Engineering and Architecture, University of Trieste, Trieste, Italy. simone.kresevic@phd.units.it.
² Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA. simone.kresevic@phd.units.it.
³ Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA. mauro.giuffre@yale.edu.
⁴ Department of Engineering and Architecture, University of Trieste, Trieste, Italy.
⁵ Department of Medical, Surgical, and Health Sciences, University of Trieste, Trieste, Italy.
⁶ Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA.

^# Contributed equally.

PMID: 38654102
PMCID: PMC11039454
DOI: 10.1038/s41746-024-01091-y

Abstract

Large language models (LLMs) can potentially transform healthcare, particularly in providing the right information to the right provider at the right time in the hospital workflow. This study investigates the integration of LLMs into healthcare, specifically focusing on improving clinical decision support systems (CDSSs) through accurate interpretation of medical guidelines for chronic Hepatitis C Virus infection management. Utilizing OpenAI's GPT-4 Turbo model, we developed a customized LLM framework that incorporates retrieval augmented generation (RAG) and prompt engineering. Our framework involved guideline conversion into the best-structured format that can be efficiently processed by LLMs to provide the most accurate output. An ablation study was conducted to evaluate the impact of different formatting and learning strategies on the LLM's answer generation accuracy. The baseline GPT-4 Turbo model's performance was compared against five experimental setups with increasing levels of complexity: inclusion of in-context guidelines, guideline reformatting, and implementation of few-shot learning. Our primary outcome was the qualitative assessment of accuracy based on expert review, while secondary outcomes included the quantitative measurement of similarity of LLM-generated responses to expert-provided answers using text-similarity scores. The results showed a significant improvement in accuracy from 43 to 99% (p < 0.001), when guidelines were provided as context in a coherent corpus of text and non-text sources were converted into text. In addition, few-shot learning did not seem to improve overall accuracy. The study highlights that structured guideline reformatting and advanced prompt engineering (data quality vs. data quantity) can enhance the efficacy of LLM integrations to CDSSs for guideline delivery.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Qualitative evaluation of accuracy among all experiments from baseline.**
a Accuracy for all questions. b Accuracy only for text-based questions. c Accuracy for table-based questions. d Accuracy for clinical scenario-based questions. Statistical testing is based on pairwise comparison (Chi-Squared Test) between each experimental setting and the baseline.

**Fig. 2. Example of a clinical decision support system integrated with large language models.**
When a patient is being evaluated for HCV treatment, the doctor prescribes several tests (laboratory and imaging), whose results are stored in the institutional EHR system. The locally hosted LLM has a standardized clinical scenario prompt with laboratory and imaging values that are directly extracted from EHR. Afterward, the standardized prompt is queried to the LLM, which has access to the relevant guidelines to recommend the most appropriate treatment. HCV Hepatitis C virus, EHR electronic health record, RAG retrieval augmented generation, LLM large language model.

**Fig. 3**
Depiction of Ablation Study experimental settings (Experiment 1 through Experiment 5) to investigate how guideline reformatting, prompt architecture, and few-shot learning impact the accuracy and robustness of LLM outputs.

See this image and copyright information in PMC

References

1. Peng C, et al. A study of generative large language model for medical research and healthcare. NPJ Digit. Med. 2023;6:210. doi: 10.1038/s41746-023-00958-w. - DOI - PMC - PubMed
1. Thirunavukarasu AJ, et al. Large language models in medicine. Nat. Med. 2023;29:1930–1940. doi: 10.1038/s41591-023-02448-8. - DOI - PubMed
1. Meskó B, et al. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 2023;6:120. doi: 10.1038/s41746-023-00873-0. - DOI - PMC - PubMed
1. Singhal K, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. doi: 10.1038/s41586-023-06291-2. - DOI - PMC - PubMed
1. Webster P. Six ways large language models are changing healthcare. Nat. Med. 2023;29:2969–2971. doi: 10.1038/s41591-023-02700-1. - DOI - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Affiliations

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources