A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study
- PMID: 35319481
- PMCID: PMC8987958
- DOI: 10.2196/27210
A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study
Abstract
Background: Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems used to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, bidirectional encoder representations from transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question answering, named entity recognition, speech recognition, and more.
Objective: The aim of this study is to develop a BERT-based system to automatically extract detailed tumor site and histology information from free-text oncological pathology reports.
Methods: We pursued three specific aims: extract accurate tumor site and histology descriptions from free-text pathology reports, accommodate the diverse terminology used to indicate the same pathology, and provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a question-and-answer (Q&A) model that connects a Q&A layer to the base pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: What organ contains the tumor? and What is the kind of tumor or carcinoma? This involved supervised training on 8197 pathology reports, each with ground truth answers to these 2 questions determined by certified tumor registrars. The data set included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict International Classification of Diseases for Oncology, Third Edition (ICD-O-3), site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes and another to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this CancerBERT network (caBERTnet). We evaluated caBERTnet using a sequestered test data set of 2050 pathology reports with ground truth answers determined by certified tumor registrars.
Results: caBERTnet's accuracies for predicting group-level site and histology codes were 93.53% (1895/2026) and 97.6% (1993/2042), respectively. The top 5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training data set were 92.95% (1794/1930) and 96.01% (1853/1930), respectively.
Conclusions: We have developed an NLP system that outperforms existing algorithms at predicting ICD-O-3 codes across an extensive range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.
Keywords: BERT; ICD-O-3; NLP; cancer; deep learning; natural language processing; pathology; transformer.
©Joseph Ross Mitchell, Phillip Szepietowski, Rachel Howard, Phillip Reisman, Jennie D Jones, Patricia Lewis, Brooke L Fridley, Dana E Rollison. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 23.03.2022.
Conflict of interest statement
Conflicts of Interest: None declared.
Figures







Similar articles
-
Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text.JMIR Med Inform. 2025 Jan 6;13:e63020. doi: 10.2196/63020. JMIR Med Inform. 2025. PMID: 39761555 Free PMC article.
-
Extracting comprehensive clinical information for breast cancer using deep learning methods.Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2. Int J Med Inform. 2019. PMID: 31627032
-
Multifaceted Natural Language Processing Task-Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation.JMIR Med Inform. 2024 Oct 30;12:e52897. doi: 10.2196/52897. JMIR Med Inform. 2024. PMID: 39475725 Free PMC article.
-
Automated labelling of radiology reports using natural language processing: Comparison of traditional and newer methods.Health Care Sci. 2023 Apr 24;2(2):120-128. doi: 10.1002/hcs2.40. eCollection 2023 Apr. Health Care Sci. 2023. PMID: 38938764 Free PMC article. Review.
-
Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies.J Biomed Semantics. 2020 Nov 16;11(1):14. doi: 10.1186/s13326-020-00231-z. J Biomed Semantics. 2020. PMID: 33198814 Free PMC article.
Cited by
-
Large Language Model Applications for Health Information Extraction in Oncology: Scoping Review.JMIR Cancer. 2025 Mar 28;11:e65984. doi: 10.2196/65984. JMIR Cancer. 2025. PMID: 40153782 Free PMC article.
-
Development and evaluation of large-language models (LLMs) for oncology: A scoping review.PLOS Digit Health. 2025 Aug 7;4(8):e0000980. doi: 10.1371/journal.pdig.0000980. eCollection 2025 Aug. PLOS Digit Health. 2025. PMID: 40773525 Free PMC article.
-
Artificial Intelligence Applications to Improve the Treatment of Locally Advanced Non-Small Cell Lung Cancers.Cancers (Basel). 2021 May 14;13(10):2382. doi: 10.3390/cancers13102382. Cancers (Basel). 2021. PMID: 34069307 Free PMC article. Review.
-
Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study.J Med Internet Res. 2025 Jul 14;27:e70080. doi: 10.2196/70080. J Med Internet Res. 2025. PMID: 40658884 Free PMC article.
-
Large language models in cancer: potentials, risks, and safeguards.BJR Artif Intell. 2024 Dec 20;2(1):ubae019. doi: 10.1093/bjrai/ubae019. eCollection 2025 Jan. BJR Artif Intell. 2024. PMID: 39777117 Free PMC article. Review.
References
-
- Pratt A, Thomas L. An information processing system for pathology data. Pathol Annul. 1966;1
-
- Meystre S, Haug PJ. Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. J Biomed Inform. 2006 Dec;39(6):589–99. doi: 10.1016/j.jbi.2005.11.004. https://linkinghub.elsevier.com/retrieve/pii/S1532-0464(05)00114-0 S1532-0464(05)00114-0 - DOI - PubMed
-
- Murff HJ, FitzHenry F, Matheny ME, Gentry N, Kotter KL, Crimin K, Dittus RS, Rosen AK, Elkin PL, Brown SH, Speroff T. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA. 2011 Aug 24;306(8):848–55. doi: 10.1001/jama.2011.1204.306/8/848 - DOI - PubMed
MeSH terms
LinkOut - more resources
Full Text Sources
Medical
Miscellaneous