Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Feb 8:2024.02.08.24302376.
doi: 10.1101/2024.02.08.24302376.

Retrieval Augmented Generation Enabled Generative Pre-Trained Transformer 4 (GPT-4) Performance for Clinical Trial Screening

Affiliations

Retrieval Augmented Generation Enabled Generative Pre-Trained Transformer 4 (GPT-4) Performance for Clinical Trial Screening

Ozan Unlu et al. medRxiv. .

Abstract

Background: Subject screening is a key aspect of all clinical trials; however, traditionally, it is a labor-intensive and error-prone task, demanding significant time and resources. With the advent of large language models (LLMs) and related technologies, a paradigm shift in natural language processing capabilities offers a promising avenue for increasing both quality and efficiency of screening efforts. This study aimed to test the Retrieval-Augmented Generation (RAG) process enabled Generative Pretrained Transformer Version 4 (GPT-4) to accurately identify and report on inclusion and exclusion criteria for a clinical trial.

Methods: The Co-Operative Program for Implementation of Optimal Therapy in Heart Failure (COPILOT-HF) trial aims to recruit patients with symptomatic heart failure. As part of the screening process, a list of potentially eligible patients is created through an electronic health record (EHR) query. Currently, structured data in the EHR can only be used to determine 5 out of 6 inclusion and 5 out of 17 exclusion criteria. Trained, but non-licensed, study staff complete manual chart review to determine patient eligibility and record their assessment of the inclusion and exclusion criteria. We obtained the structured assessments completed by the study staff and clinical notes for the past two years and developed a workflow of clinical note-based question answering system powered by RAG architecture and GPT-4 that we named RECTIFIER (RAG-Enabled Clinical Trial Infrastructure for Inclusion Exclusion Review). We used notes from 100 patients as a development dataset, 282 patients as a validation dataset, and 1894 patients as a test set. An expert clinician completed a blinded review of patients' charts to answer the eligibility questions and determine the "gold standard" answers. We calculated the sensitivity, specificity, accuracy, and Matthews correlation coefficient (MCC) for each question and screening method. We also performed bootstrapping to calculate the confidence intervals for each statistic.

Results: Both RECTIFIER and study staff answers closely aligned with the expert clinician answers across criteria with accuracy ranging between 97.9% and 100% (MCC 0.837 and 1) for RECTIFIER and 91.7% and 100% (MCC 0.644 and 1) for study staff. RECTIFIER performed better than study staff to determine the inclusion criteria of "symptomatic heart failure" with an accuracy of 97.9% vs 91.7% and an MCC of 0.924 vs 0.721, respectively. Overall, the sensitivity and specificity of determining eligibility for the RECTIFIER was 92.3% (CI) and 93.9% (CI), and study staff was 90.1% (CI) and 83.6% (CI), respectively.

Conclusion: GPT-4 based solutions have the potential to improve efficiency and reduce costs in clinical trial screening. When incorporating new tools such as RECTIFIER, it is important to consider the potential hazards of automating the screening process and set up appropriate mitigation strategies such as final clinician review before patient engagement.

Keywords: Clinical Trial; Digital health; GPT-4; Generative Pre-Trained Transformer 4; Heart failure; RAG; Retrieval Augmented Generation; Screening.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. The Workflow of Clinical Note-Based Question Answering System Powered by the RAG Architecture
Workflow of the patient Q&A system leveraging the RAG (Retrieval Augmented Generation) architecture. The workflow consists of the following key steps: 1) Clinical note retrieval 2) Note segmentation 3) Vector embeddings 4) Similarity search, prompting, and generation.
Figure 2.
Figure 2.. Comparison of Positive Predictive Value (Precision) and Sensitivity (Recall) of RECTIFIER vs Study Staff
Solid lines indicate 95% confidence interval. AI: Aortic insufficiency; AS: Aortic stenosis; CI: Confidence Interval; DM: Diabetes Mellitus; HCM: Hypertrophic cardiomyopathy; MCC: Matthews Correlation Coefficient; PAH: Pulmonary arterial hypertension, * indicates p<0.001
Figure 3.
Figure 3.. Performance Metrics of Study Staff and RECTIFIER for Overall Eligibility Determination
A) Performance metrics of RECTIFIER and Study Staff to determine overall eligibility based on 13 questions in the test set. Solid lines indicate 95% confidence interval. B) Confusion matrices of RECTIFIER and study staff against expert clinician review for overall eligibility based on 13 questions in the test set. NPV: Negative Predictive Value, PPV: Positive Predictive Value, * indicates p<0.001

References

    1. Dumitrache A, Aroyo L, Welty C. Crowdsourcing Ground Truth for Medical Relation Extraction. ACM Trans Interact Intell Syst. 2018;8(2):11:1–11:20. doi:10.1145/3152889 - DOI
    1. Kim J, Quintana Y. Review of the Performance Metrics for Natural Language Systems for Clinical Trials Matching. Stud Health Technol Inform. 2022;290:641–644. doi:10.3233/SHTI220156 - DOI - PubMed
    1. Elm JJ, Palesch Y, Easton JD, et al. Screen Failure Data in Clinical Trials: Are Screening Logs Worth It? Clin Trials. 2014;11(4):467–472. doi:10.1177/1740774514538706 - DOI - PMC - PubMed
    1. Idnay B, Dreisbach C, Weng C, Schnall R. A systematic review on natural language processing systems for eligibility prescreening in clinical research. J Am Med Inform Assoc. 2021;29(1):197–206. doi:10.1093/jamia/ocab228 - DOI - PMC - PubMed
    1. OpenAI, Achiam J, Adler S, et al. GPT-4 Technical Report. Published online December 18, 2023. doi:10.48550/arXiv.2303.08774 - DOI

Publication types