Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence
- PMID: 24488511
- PMCID: PMC3939853
- DOI: 10.1093/aje/kwt441
Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence
Abstract
The increasing availability of electronic health records (EHRs) creates opportunities for automated extraction of information from clinical text. We hypothesized that natural language processing (NLP) could substantially reduce the burden of manual abstraction in studies examining outcomes, like cancer recurrence, that are documented in unstructured clinical text, such as progress notes, radiology reports, and pathology reports. We developed an NLP-based system using open-source software to process electronic clinical notes from 1995 to 2012 for women with early-stage incident breast cancers to identify whether and when recurrences were diagnosed. We developed and evaluated the system using clinical notes from 1,472 patients receiving EHR-documented care in an integrated health care system in the Pacific Northwest. A separate study provided the patient-level reference standard for recurrence status and date. The NLP-based system correctly identified 92% of recurrences and estimated diagnosis dates within 30 days for 88% of these. Specificity was 96%. The NLP-based system overlooked 5 of 65 recurrences, 4 because electronic documents were unavailable. The NLP-based system identified 5 other recurrences incorrectly classified as nonrecurrent in the reference standard. If used in similar cohorts, NLP could reduce by 90% the number of EHR charts abstracted to identify confirmed breast cancer recurrence cases at a rate comparable to traditional abstraction.
Keywords: breast cancer recurrence; chart abstraction; natural language processing.
Figures


Comment in
-
Carrell et al. respond to "Observational research and the EHR".Am J Epidemiol. 2014 Mar 15;179(6):762-3. doi: 10.1093/aje/kwt444. Epub 2014 Jan 30. Am J Epidemiol. 2014. PMID: 24488509 Free PMC article. No abstract available.
-
Invited commentary: Observational research in the age of the electronic health record.Am J Epidemiol. 2014 Mar 15;179(6):759-61. doi: 10.1093/aje/kwt443. Epub 2014 Jan 30. Am J Epidemiol. 2014. PMID: 24488512
Similar articles
-
Using natural language processing and machine learning to identify breast cancer local recurrence.BMC Bioinformatics. 2018 Dec 28;19(Suppl 17):498. doi: 10.1186/s12859-018-2466-x. BMC Bioinformatics. 2018. PMID: 30591037 Free PMC article.
-
A Deep Learning-Enabled Workflow to Estimate Real-World Progression-Free Survival in Patients With Metastatic Breast Cancer: Study Using Deidentified Electronic Health Records.JMIR Cancer. 2025 May 15;11:e64697. doi: 10.2196/64697. JMIR Cancer. 2025. PMID: 40372953 Free PMC article.
-
Assessment of Natural Language Processing of Electronic Health Records to Measure Goals-of-Care Discussions as a Clinical Trial Outcome.JAMA Netw Open. 2023 Mar 1;6(3):e231204. doi: 10.1001/jamanetworkopen.2023.1204. JAMA Netw Open. 2023. PMID: 36862411 Free PMC article. Clinical Trial.
-
NLP for Analyzing Electronic Health Records and Clinical Notes in Cancer Research: A Review.J Pain Symptom Manage. 2025 May;69(5):e374-e394. doi: 10.1016/j.jpainsymman.2025.01.019. Epub 2025 Jan 31. J Pain Symptom Manage. 2025. PMID: 39894080 Review.
-
Discerning tumor status from unstructured MRI reports--completeness of information in existing reports and utility of automated natural language processing.J Digit Imaging. 2010 Apr;23(2):119-32. doi: 10.1007/s10278-009-9215-7. Epub 2009 May 30. J Digit Imaging. 2010. PMID: 19484309 Free PMC article. Review.
Cited by
-
Electronic Health Record (EHR) Abstraction.Perspect Health Inf Manag. 2021 Mar 15;18(Spring):1g. eCollection 2021 Spring. Perspect Health Inf Manag. 2021. PMID: 34035788 Free PMC article.
-
Use of emergency department electronic medical records for automated epidemiological surveillance of suicide attempts: a French pilot study.Int J Methods Psychiatr Res. 2017 Jun;26(2):e1522. doi: 10.1002/mpr.1522. Epub 2016 Sep 15. Int J Methods Psychiatr Res. 2017. PMID: 27634457 Free PMC article.
-
Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes.JCO Clin Cancer Inform. 2019 Sep;3:1-9. doi: 10.1200/CCI.19.00042. JCO Clin Cancer Inform. 2019. PMID: 31545655 Free PMC article.
-
Identification of Child Survivors of Sex Trafficking From Electronic Health Records: An Artificial Intelligence Guided Approach.Child Maltreat. 2024 Nov;29(4):601-611. doi: 10.1177/10775595231194599. Epub 2023 Aug 6. Child Maltreat. 2024. PMID: 37545138
-
Development, Validation, and Dissemination of a Breast Cancer Recurrence Detection and Timing Informatics Algorithm.J Natl Cancer Inst. 2018 Mar 1;110(3):273-281. doi: 10.1093/jnci/djx200. J Natl Cancer Inst. 2018. PMID: 29873757 Free PMC article.
References
-
- Dean BB, Lam J, Natoli JL, et al. Review: use of electronic medical records for health outcomes research: a literature review. Med Care Res Rev. 2009;66(6):611–638. - PubMed
-
- Hicks J. The Potential of Claims Data to Support the Measurement of Health Care Quality. Policy Analysis. Santa Monica, CA: RAND Graduate School; 2003. p. 272.
-
- Meystre SM, Savova GK, Kipper-Schuler KC, et al. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008:128–144. - PubMed
-
- Jha AK. The promise of electronic records: Around the corner or down the road? JAMA. 2011;306(8):880–881. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Medical