Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 16;5(2):ooac049.
doi: 10.1093/jamiaopen/ooac049. eCollection 2022 Jul.

Automatic information extraction from childhood cancer pathology reports

Affiliations

Automatic information extraction from childhood cancer pathology reports

Hong-Jun Yoon et al. JAMIA Open. .

Abstract

Objectives: The International Classification of Childhood Cancer (ICCC) facilitates the effective classification of a heterogeneous group of cancers in the important pediatric population. However, there has been no development of machine learning models for the ICCC classification. We developed deep learning-based information extraction models from cancer pathology reports based on the ICD-O-3 coding standard. In this article, we describe extending the models to perform ICCC classification.

Materials and methods: We developed 2 models, ICD-O-3 classification and ICCC recoding (Model 1) and direct ICCC classification (Model 2), and 4 scenarios subject to the training sample size. We evaluated these models with a corpus consisting of 29 206 reports with age at diagnosis between 0 and 19 from 6 state cancer registries.

Results: Our findings suggest that the direct ICCC classification (Model 2) is substantially better than reusing the ICD-O-3 classification model (Model 1). Applying the uncertainty quantification mechanism to assess the confidence of the algorithm in assigning a code demonstrated that the model achieved a micro-F1 score of 0.987 while abstaining (not sufficiently confident to assign a code) on only 14.8% of ambiguous pathology reports.

Conclusions: Our experimental results suggest that the machine learning-based automatic information extraction from childhood cancer pathology reports in the ICCC is a reliable means of supplementing human annotators at state cancer registries by reading and abstracting the majority of the childhood cancer pathology reports accurately and reliably.

Keywords: cancer pathology reports; information extraction; machine learning; pediatric cancer.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Number of childhood cancer pathology reports by ICCC main and subgroup codes.
Figure 2.
Figure 2.
Number of childhood cancer pathology reports by ICCC main codes and age at diagnosis.
Figure 3.
Figure 3.
Model architecture for ICCC classification from childhood cancer pathology reports. (A) Model 1: ICD-O-3 classification then ICCC recoding. (B) Model 2: direct ICCC classification.

References

    1. Siegel RL, Miller KD, Fuchs HE, et al.Cancer statistics, 2021. CA Cancer J Clin 2021; 71 (1): 7–33. - PubMed
    1. Ward E, DeSantis C, Robbins A, et al.Childhood and adolescent cancer statistics, 2014. CA Cancer J Clin 2014; 64 (2): 83–103. - PubMed
    1. Steliarova-Foucher E, Colombet M, Ries LA, et al.International incidence of childhood cancer, 2001–10: a population-based registry study. Lancet Oncol 2017; 18 (6): 719–31. - PMC - PubMed
    1. Qiu JX, Yoon HJ, Fearn PA, et al.Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE J Biomed Health Inform 2018; 22 (1): 244–51. - PubMed
    1. Alawad M, Gao S, Qiu JX, et al.Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. J Am Med Inform Assoc 2020; 27 (1): 89–98. - PMC - PubMed

LinkOut - more resources