. 2022 Mar 23;24(3):e27210.

doi: 10.2196/27210.

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study

Joseph Ross Mitchell^{1

2

3}, Phillip Szepietowski⁴, Rachel Howard⁴, Phillip Reisman⁴, Jennie D Jones⁴, Patricia Lewis⁴, Brooke L Fridley⁵, Dana E Rollison⁴

Affiliations

¹ Department of Machine Learning, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.
² Department of Medicine, Faculty of Medicine & Dentistry, and the Alberta Machine Intelligence Institute, University of Alberta, Edmonton, AB, Canada.
³ Alberta Health Services, Edmonton, AB, Canada.
⁴ Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.
⁵ Department of Biostatistics and Bioinformatics, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.

PMID: 35319481
PMCID: PMC8987958
DOI: 10.2196/27210

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study

Joseph Ross Mitchell et al. J Med Internet Res. 2022.

. 2022 Mar 23;24(3):e27210.

doi: 10.2196/27210.

Authors

Joseph Ross Mitchell^{1

2

3}, Phillip Szepietowski⁴, Rachel Howard⁴, Phillip Reisman⁴, Jennie D Jones⁴, Patricia Lewis⁴, Brooke L Fridley⁵, Dana E Rollison⁴

Affiliations

¹ Department of Machine Learning, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.
² Department of Medicine, Faculty of Medicine & Dentistry, and the Alberta Machine Intelligence Institute, University of Alberta, Edmonton, AB, Canada.
³ Alberta Health Services, Edmonton, AB, Canada.
⁴ Department of Health Data Services, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.
⁵ Department of Biostatistics and Bioinformatics, H Lee Moffitt Cancer Center and Research Institute, Tampa, FL, United States.

PMID: 35319481
PMCID: PMC8987958
DOI: 10.2196/27210

Abstract

Background: Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems used to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, bidirectional encoder representations from transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question answering, named entity recognition, speech recognition, and more.

Objective: The aim of this study is to develop a BERT-based system to automatically extract detailed tumor site and histology information from free-text oncological pathology reports.

Methods: We pursued three specific aims: extract accurate tumor site and histology descriptions from free-text pathology reports, accommodate the diverse terminology used to indicate the same pathology, and provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a question-and-answer (Q&A) model that connects a Q&A layer to the base pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: What organ contains the tumor? and What is the kind of tumor or carcinoma? This involved supervised training on 8197 pathology reports, each with ground truth answers to these 2 questions determined by certified tumor registrars. The data set included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict International Classification of Diseases for Oncology, Third Edition (ICD-O-3), site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes and another to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this CancerBERT network (caBERTnet). We evaluated caBERTnet using a sequestered test data set of 2050 pathology reports with ground truth answers determined by certified tumor registrars.

Results: caBERTnet's accuracies for predicting group-level site and histology codes were 93.53% (1895/2026) and 97.6% (1993/2042), respectively. The top 5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training data set were 92.95% (1794/1930) and 96.01% (1853/1930), respectively.

Conclusions: We have developed an NLP system that outperforms existing algorithms at predicting ICD-O-3 codes across an extensive range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.

Keywords: BERT; ICD-O-3; NLP; cancer; deep learning; natural language processing; pathology; transformer.

©Joseph Ross Mitchell, Phillip Szepietowski, Rachel Howard, Phillip Reisman, Jennie D Jones, Patricia Lewis, Brooke L Fridley, Dana E Rollison. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 23.03.2022.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
Sequence of transfer-learning steps used in training the CancerBERT base language model. BERT: bidirectional encoder representations from transformers; ICU: intensive care unit; MIMIC-III: Medical Information Mart for Intensive Care, version 3.

**Figure 2**
Lesson plan for the caBERT network consisting of one question and answering model A and two classification models, one for primary site (model B) and another for histology (model C). BioASQ: Biomedical Semantic Indexing and Question Answering; Q&A: question and answer; SQuAD: Stanford Question Answering Dataset.

**Figure 3**
Flowchart depicting the data curation process for creating the Moffitt fine-tuning data sets (used in the site and histology question-and-answer and classification tasks). MCR: Moffitt Cancer Registry; MRN: medical record number.

**Figure 4**
The final caBERT network (caBERTnet) connects caBERT instances A, B, and C, used for site and histology question and answering, International Classification of Diseases for Oncology, Third Edition (ICD-O-3) primary site and ICD-O-3 histology code classification, respectively.

**Figure 5**
The effect of culling rare tumor sites and histologies on the top N accuracy of predicting fine-grained International Classification of Diseases for Oncology, Third Edition codes.

**Figure 6**
Accuracy of predicting tumor site group codes from unstructured and previously unseen pathology reports on solid tumors, broken down to show performance within each site group. The overall accuracy over all site groups was 93.53% (1895/2026).

**Figure 7**
Accuracy of predicting tumor histology group codes from unstructured and previously unseen pathology reports on solid tumors, broken down to show performance within each histology group. The overall accuracy over all histology groups was 97.6% (1993/2042).

See this image and copyright information in PMC

References

1. Murdoch TB, Detsky AS. The inevitable application of big data to health care. JAMA. 2013 Apr 03;309(13):1351–2. doi: 10.1001/jama.2013.393.1674245 - DOI - PubMed
1. Pratt A, Thomas L. An information processing system for pathology data. Pathol Annul. 1966;1
1. Dunham GS, Pacak MG, Pratt AW. Automatic indexing of pathology data. J Am Soc Inf Sci. 1978 Mar;29(2):81–90. doi: 10.1002/asi.4630290207. - DOI - PubMed
1. Meystre S, Haug PJ. Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. J Biomed Inform. 2006 Dec;39(6):589–99. doi: 10.1016/j.jbi.2005.11.004. https://linkinghub.elsevier.com/retrieve/pii/S1532-0464(05)00114-0 S1532-0464(05)00114-0 - DOI - PubMed
1. Murff HJ, FitzHenry F, Matheny ME, Gentry N, Kotter KL, Crimin K, Dittus RS, Rosen AK, Elkin PL, Brown SH, Speroff T. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA. 2011 Aug 24;306(8):848–55. doi: 10.1001/jama.2011.1204.306/8/848 - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study

Affiliations

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous