A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification

Andrew E Blanchard, Shang Gao, Hong-Jun Yoon, J Blair Christian, Eric B Durbin, Xiao-Cheng Wu, Antoinette Stroup, Jennifer Doherty, Stephen M Schwartz, Charles Wiggins, Linda Coyle, Lynne Penberthy, Georgia D Tourassi

PMID: 35020599
PMCID: PMC9533247
DOI: 10.1109/JBHI.2022.3141976

A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification

Andrew E Blanchard et al. IEEE J Biomed Health Inform. 2022 Jun.

. 2022 Jun;26(6):2796-2803.

doi: 10.1109/JBHI.2022.3141976. Epub 2022 Jun 3.

Authors

PMID: 35020599
PMCID: PMC9533247
DOI: 10.1109/JBHI.2022.3141976

Abstract

Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.

PubMed Disclaimer

Figures

**Fig. 1.**
CNN model performance on the development dataset for three different tasks (site, subiste, histology). First row shows fraction of classes vs training samples per class. Second row shows test accuracy vs training samples for the baseline CNN model (blue), CNN + CUIs (green), and CNN + Class Weights (red). The x-axis for all plots is scaled by a factor of 5 (i.e. the intervals are 0–50, 50–500, 500–5000, and >5000 training samples).

**Fig. 2.**
For each class in the subsite task, the maximum model output probability for all bigrams in the development training corpus is shown vs the number of training samples. The two X’s in each figure correspond to the example bigrams and scores shown in Tables IV–V.

**Fig. 3.**
For each class in the site task, the maximum model output probability for all bigrams in the development training corpus is shown vs the number of training samples.

**Fig. 4.**
For each class in the histology task, the maximum model output probability for all bigrams in the development training corpus is shown vs the number of training samples.

See this image and copyright information in PMC

References

1. Yala A et al. , “Using machine learning to parse breast pathology reports,” Breast Cancer Res. Treat, vol. 161, no. 2, pp. 203–211, 2017. - PubMed
1. Gao S et al. , “Classifying cancer pathology reports with hierarchical self-attention networks,” Artif. Intell. Med, vol. 101, Sep. 2019, Art. no. 101726. - PubMed
1. Alawad M et al. , “Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks,” J. Amer. Med. Informat. Assoc, vol. 27, no. 1, pp. 89–98, 2020. - PMC - PubMed
1. Qiu JX, Yoon H-J, Fearn PA, and Tourassi GD, “Deep learning for automated extraction of primary sites from cancer pathology reports,” IEEE J. Biomed. Health Informat, vol. 22, no. 1, pp. 244–251, Jan. 2017. - PubMed
1. Siegel RL, Miller KD, and Jemal A, “Cancer statistics,” CA: Cancer J. Clinicians, vol. 69, no. 1, pp. 7–34, 2019. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification

A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification

Authors

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources