Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

Briton Park¹, Nicholas Altieri¹, John DeNero², Anobel Y Odisho^{3

4

5}, Bin Yu^{1

2

6}

Affiliations

¹ Department of Statistics, University of California, Berkeley, California, USA.
² Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA.
³ Department of Urology and Helen Diller Family Comprehensive Cancer Center, School of Medicine, University of California, San Francisco, California, USA.
⁴ Department of Epidemiology & Biostatistics, School of Medicine, University of California, San Francisco, California, USA.
⁵ Center for Digital Health Innovation, University of California, San Francisco, California, USA.
⁶ Chan-Zuckerberg Biohub, San Francisco, California, USA.

PMID: 34604711
PMCID: PMC8484934
DOI: 10.1093/jamiaopen/ooab085

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

Briton Park et al. JAMIA Open. 2021.

. 2021 Sep 30;4(3):ooab085.

doi: 10.1093/jamiaopen/ooab085. eCollection 2021 Jul.

Authors

Briton Park¹, Nicholas Altieri¹, John DeNero², Anobel Y Odisho^{3

4

5}, Bin Yu^{1

2

6}

Affiliations

¹ Department of Statistics, University of California, Berkeley, California, USA.
² Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA.
³ Department of Urology and Helen Diller Family Comprehensive Cancer Center, School of Medicine, University of California, San Francisco, California, USA.
⁴ Department of Epidemiology & Biostatistics, School of Medicine, University of California, San Francisco, California, USA.
⁵ Center for Digital Health Innovation, University of California, San Francisco, California, USA.
⁶ Chan-Zuckerberg Biohub, San Francisco, California, USA.

PMID: 34604711
PMCID: PMC8484934
DOI: 10.1093/jamiaopen/ooab085

Abstract

Objective: We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report.

Materials and methods: Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods.

Results: For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations.

Conclusions: Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.

Keywords: cancer; natural language processing; pathology.

PubMed Disclaimer

Figures

**Figure 1.**
Average macro-f1 (A) and micro-f1 (B) performance for test instances where the label is not seen during training as a function of 10, 20, and 40 labeled examples on colon, kidney, and lung cancer pathology reports. The results presented include the mean performance using ZSS across 10 random splits of the data and 95% confidence intervals for the unique labels case. Note that the number of zero-shot test instances decreases as the number of training instances increase.

**Figure 2.**
Average macro-f1 (A) and micro-f1 (B) performance for test instances where the label is not seen during training as a function of 10, 20, and 40 labeled examples on colon, kidney, and lung cancer pathology reports. The results presented include the mean performance using ZSS-thresholding across 10 random splits of the data and 95% confidence intervals for the unique labels case. Note that the number of zero-shot test instances decreases as the number of training instances increase.

See this image and copyright information in PMC

Cited by

TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models.
Kefeli J, Tatonetti N. Kefeli J, et al. Patterns (N Y). 2024 Feb 21;5(3):100933. doi: 10.1016/j.patter.2024.100933. eCollection 2024 Mar 8. Patterns (N Y). 2024. PMID: 38487800 Free PMC article.

References

1. Wang Y, Wang L, Rastegar-Mojarad M, et al.Clinical information extraction applications: a literature review. J Biomed Inform 2018; 77: 34–49. - PMC - PubMed
1. Burger G, Abu-Hanna A, de Keizer N, et al.Natural language processing in pathology: a scoping review. J Clin Pathol 2016; jclinpath-2016-203872. - PubMed
1. Martinez D, Yue L. Information extraction from pathology reports in a hospital setting. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011: 1877–82; Glasgow Scotland, UK.
1. Napolitano G, Fox C, Middleton R, et al.Pattern-based information extraction from pathology reports for cancer registration. Cancer Causes Control 2010; 21 (11): 1887–94. - PubMed
1. Schroeck FR, Patterson OV, Alba PR, et al.Development of a natural language processing engine to generate bladder cancer pathology data for health services research. Urology 2017; 110: 84–91. - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

Affiliations

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources