Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning

Corey M Benedum¹, Arjun Sondhi¹, Erin Fidyk¹, Aaron B Cohen^{1

2}, Sheila Nemeth¹, Blythe Adamson^{1

3}, Melissa Estévez¹, Selen Bozkurt¹

Affiliations

¹ Flatiron Health, Inc., 233 Spring Street, New York, NY 10003, USA.
² Department of Medicine, NYU Grossman School of Medicine, New York, NY 10016, USA.
³ Comparative Health Outcomes, Policy and Economics (CHOICE) Institute, University of Washington, Seattle, WA 98195, USA.

PMID: 36980739
PMCID: PMC10046618
DOI: 10.3390/cancers15061853

Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning

Corey M Benedum et al. Cancers (Basel). 2023.

. 2023 Mar 20;15(6):1853.

doi: 10.3390/cancers15061853.

Authors

Corey M Benedum¹, Arjun Sondhi¹, Erin Fidyk¹, Aaron B Cohen^{1

2}, Sheila Nemeth¹, Blythe Adamson^{1

3}, Melissa Estévez¹, Selen Bozkurt¹

Affiliations

¹ Flatiron Health, Inc., 233 Spring Street, New York, NY 10003, USA.
² Department of Medicine, NYU Grossman School of Medicine, New York, NY 10016, USA.
³ Comparative Health Outcomes, Policy and Economics (CHOICE) Institute, University of Washington, Seattle, WA 98195, USA.

PMID: 36980739
PMCID: PMC10046618
DOI: 10.3390/cancers15061853

Abstract

Meaningful real-world evidence (RWE) generation requires unstructured data found in electronic health records (EHRs) which are often missing from administrative claims; however, obtaining relevant data from unstructured EHR sources is resource-intensive. In response, researchers are using natural language processing (NLP) with machine learning (ML) techniques (i.e., ML extraction) to extract real-world data (RWD) at scale. This study assessed the quality and fitness-for-use of EHR-derived oncology data curated using NLP with ML as compared to the reference standard of expert abstraction. Using a sample of 186,313 patients with lung cancer from a nationwide EHR-derived de-identified database, we performed a series of replication analyses demonstrating some common analyses conducted in retrospective observational research with complex EHR-derived data to generate evidence. Eligible patients were selected into biomarker- and treatment-defined cohorts, first with expert-abstracted then with ML-extracted data. We utilized the biomarker- and treatment-defined cohorts to perform analyses related to biomarker-associated survival and treatment comparative effectiveness, respectively. Across all analyses, the results differed by less than 8% between the data curation methods, and similar conclusions were reached. These results highlight that high-performance ML-extracted variables trained on expert-abstracted data can achieve similar results as when using abstracted data, unlocking the ability to perform oncology research at scale.

Keywords: artificial intelligence; cancer; electronic health records; machine learning; natural language processing; oncology; quality; real-world data; real-world evidence.

PubMed Disclaimer

Conflict of interest statement

At the time of the study, all authors report employment at Flatiron Health, Inc., an independent subsidiary of the Roche Group, and stock ownership in Roche. ME and AC report equity ownership in Flatiron Health, Inc. (initiated before acquisition by Roche in April 2018).

Figures

**Figure 1**
Conceptual diagram of EHR data curation highlighting approaches to define variables. Abbreviations: ML: machine learning; NLP: natural language processing; RWD: real-world data. (Panel (A)): Unstructured data are reviewed by trained clinical abstractors to collect relevant data from patients’ charts. (Panel (B)): Process for developing models and extracting information from unstructured data sources from the patient’s chart.

**Figure 2**
Data curation approach for replication analyses. Abbreviations: ML: machine learning.

**Figure 3**
Results from replication of natural history study. Abbreviations: ML: machine learning; NSCLC: non-small cell lung cancer. (Panel (A)): Kaplan–Meier curves for patients with *ROS1*-positive and -negative NSCLC by data curation approach. (Panel (B)): Association between *ROS1* status and survival by data curation approach.

**Figure 4**
Results from replication of comparative effectiveness study. Abbreviations: 1L: first line; *ALK*: anaplastic lymphoma kinase; *EGFR*: epidermal growth factor receptor; PD-L1: programmed death-ligand 1; ML: machine learning. (Panel (A)): Covariate balance plot, abstracted cohort. (Panel (B)): Covariate balance plot, ML-extracted cohort. (Panel (C)): Distribution of weights stratified by treatment group, abstracted cohort. (Panel (D)): Distribution of weights stratified by treatment group, ML-extracted cohort. (Panel (E)): Effect of treatment group on survival, stratified by data curation approach.

See this image and copyright information in PMC

References

1. Guinn D., Wilhelm E.E., Lieberman G., Khozin S. Assessing function of electronic health records for real-world data generation. BMJ Evid.-Based Med. 2019;24:95–98. doi: 10.1136/bmjebm-2018-111111. - DOI - PubMed
1. Stark P. Congressional intent for the HITECH Act. [(accessed on 12 January 2023)];Am. J. Manag. Care. 2010 16:SP24–SP28. Available online: https://www.ncbi.nlm.nih.gov/pubmed/21314216. - PubMed
1. Stewart M., Norden A.D., Dreyer N., Henk H.J., Abernethy A.P., Chrischilles E., Kushi L., Mansfield A.S., Khozin S., Sharon E., et al. An Exploratory Analysis of Real-World End Points for Assessing Outcomes Among Immunotherapy-Treated Patients with Advanced Non–Small-Cell Lung Cancer. JCO Clin. Cancer Inform. 2019;3:1–15. doi: 10.1200/CCI.18.00155. - DOI - PMC - PubMed
1. Zhang J., Symons J., Agapow P., Teo J.T., Paxton C.A., Abdi J., Mattie H., Davie C., Torres A.Z., Folarin A., et al. Best practices in the real-world data life cycle. PLoS Digit. Health. 2022;1:e0000003. doi: 10.1371/journal.pdig.0000003. - DOI - PMC - PubMed
1. Birnbaum B., Nussbaum N., Seidl-Rathkopf K., Agrawal M., Estevez M., Estola E., Haimson J., He L., Larson P., Richardson P. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research. arXiv. 2020 doi: 10.48550/arXiv.2001.09765.2001.09765 - DOI

Grants and funding

N/A/Flatiron Health (United States)

LinkOut - more resources

Full Text Sources
Medical
- The YODA Project

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning

Affiliations

Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Medical