Comparative Study

. 2020 Jan 1;27(1):89-98.

doi: 10.1093/jamia/ocz153.

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

Affiliations

¹ Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.
² Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, Maryland, USA.
³ Louisiana Tumor Registry, Louisiana State University Health Sciences Center School of Public Health, New Orleans, Louisiana, USA.
⁴ Information Management Services Inc, Calverton, Maryland, USA.

PMID: 31710668
PMCID: PMC7489089
DOI: 10.1093/jamia/ocz153

Comparative Study

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

Mohammed Alawad et al. J Am Med Inform Assoc. 2020.

. 2020 Jan 1;27(1):89-98.

doi: 10.1093/jamia/ocz153.

Authors

Affiliations

¹ Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.
² Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, Maryland, USA.
³ Louisiana Tumor Registry, Louisiana State University Health Sciences Center School of Public Health, New Orleans, Louisiana, USA.
⁴ Information Management Services Inc, Calverton, Maryland, USA.

PMID: 31710668
PMCID: PMC7489089
DOI: 10.1093/jamia/ocz153

Abstract

Objective: We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency.

Materials and methods: Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC).

Results: MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN.

Conclusions: The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task-specific model.

Keywords: cancer pathology reports; convolutional neural network; deep learning; information extraction; multitask learning; natural language processing.

PubMed Disclaimer

Figures

**Figure 1.**
Louisiana Tumor Registry (LA) data preparation flow chart.

**Figure 2.**
The number of occurrences per label of all cancer characteristics.

**Figure 3.**
Architecture diagram of the hard parameter sharing multitask convolutional neural network model. Colors differentiate convolution layers, in which each set of filters uses a different filter size.

**Figure 4.**
Architecture diagram of the cross-stitch multitask convolutional neural network model.

**Figure 5.**
Prospective evaluation micro- and macro-averaged F scores comparing the multitask convolutional neural network (MTCNN) models and the baseline models. CS: cross-stitch; HS: hard parameter sharing; RFC: random forest classifier; SVM: support vector machine.

**Figure 6.**
Comparing the multitask convolutional neural network (MTCNN) models and the baseline models in terms of number of correctly classified tasks for each document: (A) retrospective evaluation (B) prospective evaluation. CS: cross-stitch; HS: hard parameter sharing; RFC: random forest classifier; SVM: support vector machine.

See this image and copyright information in PMC

References

1. Yala A, Barzilay R, Salama L, et al. Using machine learning to parse breast pathology reports. Breast Cancer Res Treat 2017; 161(2):203–11. - PubMed
1. Wu Y, Denny JC, Rosenbloom S, Miller RA, Giuse DA, Xu H. A comparative study of current clinical natural language processing systems on handling abbreviations in discharge summaries. AMIA 2012, American Medical Informatics Association Annual Symposium; November 3-7, 2012; Chicago, IL. - PMC - PubMed
1. Buckley JM, Coopey SB, Sharko J, et al. The feasibility of using natural language processing to extract clinical information from breast pathology reports. J Pathol Inform 2012; 3:23. - PMC - PubMed
1. Penberthy LT, Winn DM, Scott SM.. Cancer surveillance informatics In: Hesse BW, Ahern D, Beckjord E, eds. Oncology Informatics. New York, NY: Elsevier; 2016: 277–85.
1. Spasic I, Livsey J, Keane JA, Nenadic G.. Text mining of cancer-related information: Review of current status and future directions. Int J Med Inform 2014; 83(9):603–23. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

Affiliations

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical