Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Observational Study
. 2020 Sep;108(3):644-652.
doi: 10.1002/cpt.1966. Epub 2020 Jul 18.

An Electronic Health Record Text Mining Tool to Collect Real-World Drug Treatment Outcomes: A Validation Study in Patients With Metastatic Renal Cell Carcinoma

Affiliations
Observational Study

An Electronic Health Record Text Mining Tool to Collect Real-World Drug Treatment Outcomes: A Validation Study in Patients With Metastatic Renal Cell Carcinoma

Sylvia A van Laar et al. Clin Pharmacol Ther. 2020 Sep.

Abstract

Real-world evidence can close the inferential gap between marketing authorization studies and clinical practice. However, the current standard for real-world data extraction from electronic health records (EHRs) for treatment evaluation is manual review (MR), which is time-consuming and laborious. Clinical Data Collector (CDC) is a novel natural language processing and text mining software tool for both structured and unstructured EHR data and only shows relevant EHR sections improving efficiency. We investigated CDC as a real-world data (RWD) collection method, through application of CDC queries for patient inclusion and information extraction on a cohort of patients with metastatic renal cell carcinoma (RCC) receiving systemic drug treatment. Baseline patient characteristics, disease characteristics, and treatment outcomes were extracted and these were compared with MR for validation. One hundred patients receiving 175 treatments were included using CDC, which corresponded to 99% with MR. Calculated median overall survival was 21.7 months (95% confidence interval (CI) 18.7-24.8) vs. 21.7 months (95% CI 18.6-24.8) and progression-free survival 8.9 months (95% CI 5.4-12.4) vs. 7.6 months (95% CI 5.7-9.4) for CDC vs. MR, respectively. Highest F1-score was found for cancer-related variables (88.1-100), followed by comorbidities (71.5-90.4) and adverse drug events (53.3-74.5), with most diverse scores on international metastatic RCC database criteria (51.4-100). Mean data collection time was 12 minutes (CDC) vs. 86 minutes (MR). In conclusion, CDC is a promising tool for retrieving RWD from EHRs because the correct patient population can be identified as well as relevant outcome data, such as overall survival and progression-free survival.

PubMed Disclaimer

Conflict of interest statement

The authors declared no competing interests for this work.

Figures

Figure 1
Figure 1
Architecture of the Clinical Data Collector on‐premises isolation platform. (a) Copy of electronic health record (EHR) data transferred, stored, and cleaned in a local MSSQL Server relational database. (b) Natural language processing (NLP) transformation application programming interface (API) pseudonymizes data. (c) Search engine is compatible with the structure used in data warehouse. (d) Client to build queries by a user. Results window in CDC shows only parts of EHR documents containing defined criteria by user. (e) Text mining of (combinations of) keywords is supported by an online thesaurus. [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 2
Figure 2
Data extraction approach from structured and unstructured data using Clinical Data Collector.
Figure 3
Figure 3
Flowchart of patient inclusion of manual inclusion and inclusion with Clinical Data Collector (CDC). The two approaches yielded patient samples that were very similar and therefore use of CDC is satisfactory for the intended purpose. DTC, Diagnosis Treatment Combination.
Figure 4
Figure 4
Kaplan–Meier survival plots determined from manual review and Clinical Data Collector data for cabozantinib, everolimus, nivolumab, pazopanib and sunitinib combined. (a) Overall survival, (b) Progression‐free survival. CI, confidence interval.
Figure 5
Figure 5
Bland–Altman plots of continuous variables collected using CDC vs. manual with mean difference and 95% confidence interval. (a) Length: −0.21 cm (−4.2 to 4.8), (b) Weight: 1.1 kg (−6.6 to 8.7), (c) Age: −0.17 years (−0.27 to 0.24), (d) Estimated glomerular filtration rate (eGRF) 0.22 ml/min/1.73m2 (−5.3 to 5.8), (e) Alanine transaminase (ALAT) 0.19 U/L (−3.2 to 3.6), (f) Aspartate aminotransferase (ASAT) 0.24 (−4.0 to 4.5).

References

    1. Franklin, J.M. & Schneeweiss, S. When and how can real world data analyses substitute for randomized controlled trials? Clin. Pharmacol. Ther. 102, 924–933 (2017). - PubMed
    1. Bothwell, L.E. & Podolsky, S.H. The emergence of the randomized controlled trial. N. Engl. J. Med. 375, 501–504 (2016). - PubMed
    1. Verweij, J. et al Innovation in oncology clinical trial design. Cancer Treat. Rev. 74, 15–20 (2019). - PubMed
    1. Chen, E.Y. , Raghunathan, V. & Prasad, V. An overview of cancer drugs approved by the US Food and Drug Administration based on the surrogate end point of response rate. JAMA Intern. Med. 179, 915–921 (2019). - PMC - PubMed
    1. Lakdawalla, D.N. et al Predicting real‐world effectiveness of cancer therapies using overall survival and progression‐free survival from clinical trials: empirical evidence for the ASCO value framework. Value Health 20, 866–875 (2017). - PubMed

Publication types

MeSH terms

Substances