Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data
- PMID: 33929889
- PMCID: PMC8462655
- DOI: 10.1200/CCI.20.00165
Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data
Abstract
Purpose: Large-scale analysis of real-world evidence is often limited to structured data fields that do not contain reliable information on recurrence status and disease sites. In this report, we describe a natural language processing (NLP) framework that uses data from free-text, unstructured reports to classify recurrence status and sites of recurrence for patients with breast and hepatocellular carcinomas (HCC).
Methods: Using two cohorts of breast cancer and HCC cases, we validated the ability of a previously developed NLP model to distinguish between no recurrence, local recurrence, and distant recurrence, based on clinician notes, radiology reports, and pathology reports compared with manual curation. A second NLP model was trained and validated to identify sites of recurrence. We compared the ability of each NLP model to identify the presence, timing, and site of recurrence, when compared against manual chart review and International Classification of Diseases coding.
Results: A total of 1,273 patients were included in the development and validation of the two models. The NLP model for recurrence detects distant recurrence with an area under the curve of 0.98 (95% CI, 0.96 to 0.99) and 0.95 (95% CI, 0.88 to 0.98) in breast and HCC cohorts, respectively. The mean accuracy of the NLP model for detecting any site of distant recurrence was 0.9 for breast cancer and 0.83 for HCC. The NLP model for recurrence identified a larger proportion of patients with distant recurrence in a breast cancer database (11.1%) compared with International Classification of Diseases coding (2.31%).
Conclusion: We developed two NLP models to identify distant cancer recurrence, timing of recurrence, and sites of recurrence based on unstructured electronic health record data. These models can be used to perform large-scale retrospective studies in oncology.
Conflict of interest statement
Figures





Comment in
-
Regarding the Utility of Unstructured Data and Natural Language Processing for Identification of Breast Cancer Recurrence.JCO Clin Cancer Inform. 2021 Sep;5:1024-1025. doi: 10.1200/CCI.21.00091. JCO Clin Cancer Inform. 2021. PMID: 34637320 Free PMC article. No abstract available.
-
Reply to Ritzwoller et al.JCO Clin Cancer Inform. 2021 Sep;5:1026-1027. doi: 10.1200/CCI.21.00145. JCO Clin Cancer Inform. 2021. PMID: 34637331 No abstract available.
References
-
- Adamo MB, Johnson CH, Ruhl JL, Dickie LA, editors. SEER Program Coding and Staging Manual 2013. Bethesda, MD: National Cancer Institute; Surveillance Systems Branch Surveillance Research Program Division of Cancer Control and Population Sciences.
-
- Whyte JL, Engel-Nitz NM, Teitelbaum A, et al. An evaluation of algorithms for identifying metastatic breast, lung, or colorectal cancer in administrative claims data Med Care 53e49–e572015 - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources