Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun;26(6):2796-2803.
doi: 10.1109/JBHI.2022.3141976. Epub 2022 Jun 3.

A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification

A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification

Andrew E Blanchard et al. IEEE J Biomed Health Inform. 2022 Jun.

Abstract

Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
CNN model performance on the development dataset for three different tasks (site, subiste, histology). First row shows fraction of classes vs training samples per class. Second row shows test accuracy vs training samples for the baseline CNN model (blue), CNN + CUIs (green), and CNN + Class Weights (red). The x-axis for all plots is scaled by a factor of 5 (i.e. the intervals are 0–50, 50–500, 500–5000, and >5000 training samples).
Fig. 2.
Fig. 2.
For each class in the subsite task, the maximum model output probability for all bigrams in the development training corpus is shown vs the number of training samples. The two X’s in each figure correspond to the example bigrams and scores shown in Tables IV–V.
Fig. 3.
Fig. 3.
For each class in the site task, the maximum model output probability for all bigrams in the development training corpus is shown vs the number of training samples.
Fig. 4.
Fig. 4.
For each class in the histology task, the maximum model output probability for all bigrams in the development training corpus is shown vs the number of training samples.

References

    1. Yala A et al. , “Using machine learning to parse breast pathology reports,” Breast Cancer Res. Treat, vol. 161, no. 2, pp. 203–211, 2017. - PubMed
    1. Gao S et al. , “Classifying cancer pathology reports with hierarchical self-attention networks,” Artif. Intell. Med, vol. 101, Sep. 2019, Art. no. 101726. - PubMed
    1. Alawad M et al. , “Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks,” J. Amer. Med. Informat. Assoc, vol. 27, no. 1, pp. 89–98, 2020. - PMC - PubMed
    1. Qiu JX, Yoon H-J, Fearn PA, and Tourassi GD, “Deep learning for automated extraction of primary sites from cancer pathology reports,” IEEE J. Biomed. Health Informat, vol. 22, no. 1, pp. 244–251, Jan. 2017. - PubMed
    1. Siegel RL, Miller KD, and Jemal A, “Cancer statistics,” CA: Cancer J. Clinicians, vol. 69, no. 1, pp. 7–34, 2019. - PubMed

Publication types