This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Feb 8:2024.02.08.24302376.

doi: 10.1101/2024.02.08.24302376.

Retrieval Augmented Generation Enabled Generative Pre-Trained Transformer 4 (GPT-4) Performance for Clinical Trial Screening

Ozan Unlu^{1

2

3

4}, Jiyeon Shin^{1

5}, Charlotte J Mailly^{1

5}, Michael F Oates^{1

5}, Michela R Tucci¹, Matthew Varugheese¹, Kavishwar Wagholikar^{1

6}, Fei Wang^{1

5}, Benjamin M Scirica^{1

2

4}, Alexander J Blood^{1

2

4}, Samuel J Aronson^{1

5}

Affiliations

¹ Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA.
² Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA.
³ Department of Biomedical Informatics, Harvard Medical School, Boston, MA.
⁴ Harvard Medical School, Boston, MA.
⁵ Mass General Brigham Personalized Medicine, Cambridge, MA.
⁶ Research Information Science and Computing, Mass General Brigham, Somerville, MA.

PMID: 38370719
PMCID: PMC10871450
DOI: 10.1101/2024.02.08.24302376

Retrieval Augmented Generation Enabled Generative Pre-Trained Transformer 4 (GPT-4) Performance for Clinical Trial Screening

Ozan Unlu et al. medRxiv. 2024.

[Preprint]. 2024 Feb 8:2024.02.08.24302376.

doi: 10.1101/2024.02.08.24302376.

Authors

Affiliations

¹ Accelerator for Clinical Transformation, Brigham and Women's Hospital, Boston, MA.
² Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA.
³ Department of Biomedical Informatics, Harvard Medical School, Boston, MA.
⁴ Harvard Medical School, Boston, MA.
⁵ Mass General Brigham Personalized Medicine, Cambridge, MA.
⁶ Research Information Science and Computing, Mass General Brigham, Somerville, MA.

PMID: 38370719
PMCID: PMC10871450
DOI: 10.1101/2024.02.08.24302376

Abstract

Background: Subject screening is a key aspect of all clinical trials; however, traditionally, it is a labor-intensive and error-prone task, demanding significant time and resources. With the advent of large language models (LLMs) and related technologies, a paradigm shift in natural language processing capabilities offers a promising avenue for increasing both quality and efficiency of screening efforts. This study aimed to test the Retrieval-Augmented Generation (RAG) process enabled Generative Pretrained Transformer Version 4 (GPT-4) to accurately identify and report on inclusion and exclusion criteria for a clinical trial.

Methods: The Co-Operative Program for Implementation of Optimal Therapy in Heart Failure (COPILOT-HF) trial aims to recruit patients with symptomatic heart failure. As part of the screening process, a list of potentially eligible patients is created through an electronic health record (EHR) query. Currently, structured data in the EHR can only be used to determine 5 out of 6 inclusion and 5 out of 17 exclusion criteria. Trained, but non-licensed, study staff complete manual chart review to determine patient eligibility and record their assessment of the inclusion and exclusion criteria. We obtained the structured assessments completed by the study staff and clinical notes for the past two years and developed a workflow of clinical note-based question answering system powered by RAG architecture and GPT-4 that we named RECTIFIER (RAG-Enabled Clinical Trial Infrastructure for Inclusion Exclusion Review). We used notes from 100 patients as a development dataset, 282 patients as a validation dataset, and 1894 patients as a test set. An expert clinician completed a blinded review of patients' charts to answer the eligibility questions and determine the "gold standard" answers. We calculated the sensitivity, specificity, accuracy, and Matthews correlation coefficient (MCC) for each question and screening method. We also performed bootstrapping to calculate the confidence intervals for each statistic.

Results: Both RECTIFIER and study staff answers closely aligned with the expert clinician answers across criteria with accuracy ranging between 97.9% and 100% (MCC 0.837 and 1) for RECTIFIER and 91.7% and 100% (MCC 0.644 and 1) for study staff. RECTIFIER performed better than study staff to determine the inclusion criteria of "symptomatic heart failure" with an accuracy of 97.9% vs 91.7% and an MCC of 0.924 vs 0.721, respectively. Overall, the sensitivity and specificity of determining eligibility for the RECTIFIER was 92.3% (CI) and 93.9% (CI), and study staff was 90.1% (CI) and 83.6% (CI), respectively.

Conclusion: GPT-4 based solutions have the potential to improve efficiency and reduce costs in clinical trial screening. When incorporating new tools such as RECTIFIER, it is important to consider the potential hazards of automating the screening process and set up appropriate mitigation strategies such as final clinician review before patient engagement.

Keywords: Clinical Trial; Digital health; GPT-4; Generative Pre-Trained Transformer 4; Heart failure; RAG; Retrieval Augmented Generation; Screening.

PubMed Disclaimer

Figures

**Figure 1.. The Workflow of Clinical Note-Based Question Answering System Powered by the RAG Architecture**
Workflow of the patient Q&A system leveraging the RAG (Retrieval Augmented Generation) architecture. The workflow consists of the following key steps: 1) Clinical note retrieval 2) Note segmentation 3) Vector embeddings 4) Similarity search, prompting, and generation.

**Figure 2.. Comparison of Positive Predictive Value (Precision) and Sensitivity (Recall) of RECTIFIER vs Study Staff**
Solid lines indicate 95% confidence interval. **AI:** Aortic insufficiency; AS: Aortic stenosis; CI: Confidence Interval; DM: Diabetes Mellitus; **HCM:** Hypertrophic cardiomyopathy; **MCC:** Matthews Correlation Coefficient; **PAH:** Pulmonary arterial hypertension, * indicates p<0.001

**Figure 3.. Performance Metrics of Study Staff and RECTIFIER for Overall Eligibility Determination**
A) Performance metrics of RECTIFIER and Study Staff to determine overall eligibility based on 13 questions in the test set. Solid lines indicate 95% confidence interval. B) Confusion matrices of RECTIFIER and study staff against expert clinician review for overall eligibility based on 13 questions in the test set. **NPV:** Negative Predictive Value, **PPV:** Positive Predictive Value, * indicates p<0.001

See this image and copyright information in PMC

References

1. Dumitrache A, Aroyo L, Welty C. Crowdsourcing Ground Truth for Medical Relation Extraction. ACM Trans Interact Intell Syst. 2018;8(2):11:1–11:20. doi:10.1145/3152889 - DOI
1. Kim J, Quintana Y. Review of the Performance Metrics for Natural Language Systems for Clinical Trials Matching. Stud Health Technol Inform. 2022;290:641–644. doi:10.3233/SHTI220156 - DOI - PubMed
1. Elm JJ, Palesch Y, Easton JD, et al. Screen Failure Data in Clinical Trials: Are Screening Logs Worth It? Clin Trials. 2014;11(4):467–472. doi:10.1177/1740774514538706 - DOI - PMC - PubMed
1. Idnay B, Dreisbach C, Weng C, Schnall R. A systematic review on natural language processing systems for eligibility prescreening in clinical research. J Am Med Inform Assoc. 2021;29(1):197–206. doi:10.1093/jamia/ocab228 - DOI - PMC - PubMed
1. OpenAI, Achiam J, Adler S, et al. GPT-4 Technical Report. Published online December 18, 2023. doi:10.48550/arXiv.2303.08774 - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Retrieval Augmented Generation Enabled Generative Pre-Trained Transformer 4 (GPT-4) Performance for Clinical Trial Screening

Affiliations

Retrieval Augmented Generation Enabled Generative Pre-Trained Transformer 4 (GPT-4) Performance for Clinical Trial Screening

Authors

Affiliations

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous