Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May 12;22(1):136.
doi: 10.1186/s12874-022-01583-z.

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

Affiliations

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

Yifu Chen et al. BMC Med Res Methodol. .

Abstract

Background: Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated information extraction system is needed to ensure the timeliness and scalability of research data.

Methods: We used a dataset of 100 synoptic operative and 100 pathology reports, evenly split into 50 reports in training and test sets for each report type. The training set guided our development of a Natural Language Processing (NLP) extraction pipeline system, which accepts scanned images of operative and pathology reports. The system uses a combination of rule-based and transfer learning methods to extract numeric encodings from text. We also developed visualization tools to compare the manual and automated extractions. The code for this paper was made available on GitHub.

Results: A test set of 50 operative and 50 pathology reports were used to evaluate the extraction accuracies of the NLP pipeline. Gold standard, defined as manual extraction by expert reviewers, yielded accuracies of 90.5% for operative reports and 96.0% for pathology reports, while the NLP system achieved overall 91.9% (operative) and 95.4% (pathology) accuracy. The pipeline successfully extracted outcomes data pertinent to breast cancer tumor characteristics (e.g. presence of invasive carcinoma, size, histologic type), prognostic factors (e.g. number of lymph nodes with micro-metastases and macro-metastases, pathologic stage), and treatment-related variables (e.g. margins, neo-adjuvant treatment, surgical indication) with high accuracy. Out of the 48 variables across operative and pathology codebooks, NLP yielded 43 variables with F-scores of at least 0.90; in comparison, a trained human annotator yielded 44 variables with F-scores of at least 0.90.

Conclusions: The NLP system achieves near-human-level accuracy in both operative and pathology reports using a minimal curated dataset. This system uniquely provides a robust solution for transparent, adaptable, and scalable automation of data extraction from patient health records. It may serve to advance breast cancer clinical research by facilitating collection of vast amounts of valuable health data at a population level.

Keywords: Breast cancer; Health data; Natural language processing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Overview of the NLP system design, input, and outputs
Fig. 2
Fig. 2
a NLP versus the second human reviewer as compared to the GT in operative reports. b NLP versus the second human reviewer as compared to the GT in pathology reports

References

    1. Canadian Cancer Society, Statistics Canada, Public Health Agency of Canada. Canadian Cancer Statistics 2019. 2019. https://cdn.cancer.ca/-/media/files/research/cancer-statistics/2019-stat.... Accessed 1 Apr 2020.
    1. Canadian Partnership Against Cancer, Canadian Institute for Health Information. Breast Cancer Surgery in Canada, 2007-2008 to 2009-2010. 2012. https://publications.gc.ca/site/archivee-archived.html?https://publications.gc.ca/collections/collection_2012/icis-cihi/H115-61.... Accessed 4 Apr 2020.
    1. Bray F, McCarron P, Parkin DM. The changing global patterns of female breast cancer incidence and mortality. Breast Cancer Res. 2004;6:229–239. doi: 10.1186/bcr932. - DOI - PMC - PubMed
    1. Economic Burden of Illness in Canada, 2010, Public Health Agency of Canada. http://www.phac-aspc.gc.ca/ebic-femc/index- eng.php, Accessed 1 Apr 2020.
    1. Allemani C, Matsuda T, Di Carlo V, et al. Global surveillance of trends in cancer survival 2000-14 (CONCORD-3): analysis of individual records for 37 513 025 patients diagnosed with one of 18 cancers from 322 population-based registries in 71 countries. Lancet. 2018;391:1023–1075. doi: 10.1016/S0140-6736(17)33326-3. - DOI - PMC - PubMed

Publication types