Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 20;15(6):1853.
doi: 10.3390/cancers15061853.

Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning

Affiliations

Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning

Corey M Benedum et al. Cancers (Basel). .

Abstract

Meaningful real-world evidence (RWE) generation requires unstructured data found in electronic health records (EHRs) which are often missing from administrative claims; however, obtaining relevant data from unstructured EHR sources is resource-intensive. In response, researchers are using natural language processing (NLP) with machine learning (ML) techniques (i.e., ML extraction) to extract real-world data (RWD) at scale. This study assessed the quality and fitness-for-use of EHR-derived oncology data curated using NLP with ML as compared to the reference standard of expert abstraction. Using a sample of 186,313 patients with lung cancer from a nationwide EHR-derived de-identified database, we performed a series of replication analyses demonstrating some common analyses conducted in retrospective observational research with complex EHR-derived data to generate evidence. Eligible patients were selected into biomarker- and treatment-defined cohorts, first with expert-abstracted then with ML-extracted data. We utilized the biomarker- and treatment-defined cohorts to perform analyses related to biomarker-associated survival and treatment comparative effectiveness, respectively. Across all analyses, the results differed by less than 8% between the data curation methods, and similar conclusions were reached. These results highlight that high-performance ML-extracted variables trained on expert-abstracted data can achieve similar results as when using abstracted data, unlocking the ability to perform oncology research at scale.

Keywords: artificial intelligence; cancer; electronic health records; machine learning; natural language processing; oncology; quality; real-world data; real-world evidence.

PubMed Disclaimer

Conflict of interest statement

At the time of the study, all authors report employment at Flatiron Health, Inc., an independent subsidiary of the Roche Group, and stock ownership in Roche. ME and AC report equity ownership in Flatiron Health, Inc. (initiated before acquisition by Roche in April 2018).

Figures

Figure 1
Figure 1
Conceptual diagram of EHR data curation highlighting approaches to define variables. Abbreviations: ML: machine learning; NLP: natural language processing; RWD: real-world data. (Panel (A)): Unstructured data are reviewed by trained clinical abstractors to collect relevant data from patients’ charts. (Panel (B)): Process for developing models and extracting information from unstructured data sources from the patient’s chart.
Figure 2
Figure 2
Data curation approach for replication analyses. Abbreviations: ML: machine learning.
Figure 3
Figure 3
Results from replication of natural history study. Abbreviations: ML: machine learning; NSCLC: non-small cell lung cancer. (Panel (A)): Kaplan–Meier curves for patients with ROS1-positive and -negative NSCLC by data curation approach. (Panel (B)): Association between ROS1 status and survival by data curation approach.
Figure 4
Figure 4
Results from replication of comparative effectiveness study. Abbreviations: 1L: first line; ALK: anaplastic lymphoma kinase; EGFR: epidermal growth factor receptor; PD-L1: programmed death-ligand 1; ML: machine learning. (Panel (A)): Covariate balance plot, abstracted cohort. (Panel (B)): Covariate balance plot, ML-extracted cohort. (Panel (C)): Distribution of weights stratified by treatment group, abstracted cohort. (Panel (D)): Distribution of weights stratified by treatment group, ML-extracted cohort. (Panel (E)): Effect of treatment group on survival, stratified by data curation approach.

References

    1. Guinn D., Wilhelm E.E., Lieberman G., Khozin S. Assessing function of electronic health records for real-world data generation. BMJ Evid.-Based Med. 2019;24:95–98. doi: 10.1136/bmjebm-2018-111111. - DOI - PubMed
    1. Stark P. Congressional intent for the HITECH Act. [(accessed on 12 January 2023)];Am. J. Manag. Care. 2010 16:SP24–SP28. Available online: https://www.ncbi.nlm.nih.gov/pubmed/21314216. - PubMed
    1. Stewart M., Norden A.D., Dreyer N., Henk H.J., Abernethy A.P., Chrischilles E., Kushi L., Mansfield A.S., Khozin S., Sharon E., et al. An Exploratory Analysis of Real-World End Points for Assessing Outcomes Among Immunotherapy-Treated Patients with Advanced Non–Small-Cell Lung Cancer. JCO Clin. Cancer Inform. 2019;3:1–15. doi: 10.1200/CCI.18.00155. - DOI - PMC - PubMed
    1. Zhang J., Symons J., Agapow P., Teo J.T., Paxton C.A., Abdi J., Mattie H., Davie C., Torres A.Z., Folarin A., et al. Best practices in the real-world data life cycle. PLoS Digit. Health. 2022;1:e0000003. doi: 10.1371/journal.pdig.0000003. - DOI - PMC - PubMed
    1. Birnbaum B., Nussbaum N., Seidl-Rathkopf K., Agrawal M., Estevez M., Estola E., Haimson J., He L., Larson P., Richardson P. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research. arXiv. 2020 doi: 10.48550/arXiv.2001.09765.2001.09765 - DOI

LinkOut - more resources