Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr:5:469-478.
doi: 10.1200/CCI.20.00165.

Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data

Affiliations

Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data

Yasmin H Karimi et al. JCO Clin Cancer Inform. 2021 Apr.

Abstract

Purpose: Large-scale analysis of real-world evidence is often limited to structured data fields that do not contain reliable information on recurrence status and disease sites. In this report, we describe a natural language processing (NLP) framework that uses data from free-text, unstructured reports to classify recurrence status and sites of recurrence for patients with breast and hepatocellular carcinomas (HCC).

Methods: Using two cohorts of breast cancer and HCC cases, we validated the ability of a previously developed NLP model to distinguish between no recurrence, local recurrence, and distant recurrence, based on clinician notes, radiology reports, and pathology reports compared with manual curation. A second NLP model was trained and validated to identify sites of recurrence. We compared the ability of each NLP model to identify the presence, timing, and site of recurrence, when compared against manual chart review and International Classification of Diseases coding.

Results: A total of 1,273 patients were included in the development and validation of the two models. The NLP model for recurrence detects distant recurrence with an area under the curve of 0.98 (95% CI, 0.96 to 0.99) and 0.95 (95% CI, 0.88 to 0.98) in breast and HCC cohorts, respectively. The mean accuracy of the NLP model for detecting any site of distant recurrence was 0.9 for breast cancer and 0.83 for HCC. The NLP model for recurrence identified a larger proportion of patients with distant recurrence in a breast cancer database (11.1%) compared with International Classification of Diseases coding (2.31%).

Conclusion: We developed two NLP models to identify distant cancer recurrence, timing of recurrence, and sites of recurrence based on unstructured electronic health record data. These models can be used to perform large-scale retrospective studies in oncology.

PubMed Disclaimer

Conflict of interest statement

Douglas W. BlayneyLeadership: Artelo BiosciencesStock and Other Ownership Interests: Artelo Biosciences, MadorraConsulting or Advisory Role: Creare, Daiichi Sankyo, Embold Health, Lilly, Google, IpsenResearch Funding: Amgen, BeyondSpring PharmaceuticalsOpen Payments Link: https://openpaymentsdata.cms.gov/physician/728442https://openpaymentsdata.cms.gov/physician/728442 Allison W. KurianResearch Funding: Myriad GeneticsOther Relationship: Ambry Genetics, Color Genomics, GeneDx/BioReference, InVitae, Genentech Daniel RubinConsulting or Advisory Role: Roche/GenentechResearch Funding: GE Healthcare, Philips HealthcarePatents, Royalties, Other Intellectual Property: Several pending patents on AI algorithmsNo other potential conflicts of interest were reported.

Figures

FIG 1.
FIG 1.
Model development: Step 1: recurrence prediction model no recurrence within quarter OR predicted probability of recurrence > .2. Step 2: sites of recurrence prediction model site of recurrence within quarter.
FIG 2.
FIG 2.
(A) Breast cohort local versus distant recurrence probability predicted by the natural language processing (NLP) models (cohort B). (B) Hepatocellular carcinomas cohort local versus distant recurrence probability predicted by the NLP models (cohort C).
FIG 3.
FIG 3.
(A) Venn diagram showing overlap between patients who were found to have distant recurrence as predicted by NLP versus metastatic disease as predicted by ICD codes. (B) Sensitivity or specificity tradeoffs for distant recurrence detection in breast cancer cohort B. ICD, International Classification of Diseases; NLP, natural language processing.
FIG A1.
FIG A1.
Inclusion criteria and use of each cohort in the development and validation of the NLP models. CT or MR, computed tomography or magnetic resonance; HCC, hepatocellular carcinomas; NLP, natural language processing.
FIG A2.
FIG A2.
Area under the curves for timing and presence of distant recurrence.

Comment in

References

    1. Adamo MB, Johnson CH, Ruhl JL, Dickie LA, editors. SEER Program Coding and Staging Manual 2013. Bethesda, MD: National Cancer Institute; Surveillance Systems Branch Surveillance Research Program Division of Cancer Control and Population Sciences.
    1. Lamont EB, Herndon JE, Weeks JC, et al. Measuring disease-free survival and cancer relapse using Medicare claims from CALGB breast cancer trial participants (companion to 9344) J Natl Cancer Inst 981335–13382006 - PMC - PubMed
    1. Chubak J, Yu O, Pocobelli G, et al. Administrative data algorithms to identify second breast cancer events following early-stage invasive breast cancer J Natl Cancer Inst 104931–9402012 - PMC - PubMed
    1. Whyte JL, Engel-Nitz NM, Teitelbaum A, et al. An evaluation of algorithms for identifying metastatic breast, lung, or colorectal cancer in administrative claims data Med Care 53e49–e572015 - PubMed
    1. Liede A, Hernandez RK, Roth M, et al. Validation of International Classification of Diseases coding for bone metastases in electronic health records using technology-enabled abstraction Clin Epidemiol 7441–4482015 - PMC - PubMed

Publication types