Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 19:8:178.
doi: 10.3389/fpubh.2020.00178. eCollection 2020.

Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis

Affiliations

Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis

Koichi Fujiwara et al. Front Public Health. .

Abstract

A considerable amount of health record (HR) data has been stored due to recent advances in the digitalization of medical systems. However, it is not always easy to analyze HR data, particularly when the number of persons with a target disease is too small in comparison with the population. This situation is called the imbalanced data problem. Over-sampling and under-sampling are two approaches for redressing an imbalance between minority and majority examples, which can be combined into ensemble algorithms. However, these approaches do not function when the absolute number of minority examples is small, which is called the extremely imbalanced and small minority (EISM) data problem. The present work proposes a new algorithm called boosting combined with heuristic under-sampling and distribution-based sampling (HUSDOS-Boost) to solve the EISM data problem. To make an artificially balanced dataset from the original imbalanced datasets, HUSDOS-Boost uses both under-sampling and over-sampling to eliminate redundant majority examples based on prior boosting results and to generate artificial minority examples by following the minority class distribution. The performance and characteristics of HUSDOS-Boost were evaluated through application to eight imbalanced datasets. In addition, the algorithm was applied to original clinical HR data to detect patients with stomach cancer. These results showed that HUSDOS-Boost outperformed current imbalanced data handling methods, particularly when the data are EISM. Thus, the proposed HUSDOS-Boost is a useful methodology of HR data analysis.

Keywords: boosting; health record analysis; imbalanced data problem; over- and under-sampling; stomach cancer detection.

PubMed Disclaimer

Figures

Figure 1
Figure 1
G-means of HUSDOS-Boost and RUSBoost vs. #Minority.
Figure 2
Figure 2
G-means of HUSDOS-Boost and SMOTE vs. No.
Figure 3
Figure 3
G-means of HUSDOS-Boost and RUSBoost vs. Nu.
Figure 4
Figure 4
ROC of HUSDOS-Boost and RUSBoost.
Figure 5
Figure 5
PRC of HUSDOS-Boost and RUSBoost.
Figure 6
Figure 6
Variable importance: HUSDOS-Boost (left) and RUSBoost (right).

References

    1. Gunter TD, Terry NP. The emergence of national electronic health record architectures in the United States and Australia: models, costs, and questions. J Med Internet Res. (2005) 7:e3. 10.2196/jmir.7.1.e3 - DOI - PMC - PubMed
    1. Kierkegaard P. Electronic health record: wiring Europe's healthcare. Comput Law Secur Rev. (2011) 27:503–15. 10.1016/j.clsr.2011.07.013 - DOI
    1. Wu PY, Cheng CW, Kaddi CD, Venugopalan J, Hoffman R, IEEE et al. . -Omic and electronic health record big data analytics for precision medicine. IEEE Trans Biomed Eng. (2017) 64:263–73. 10.1109/TBME.2016.2573285 - DOI - PMC - PubMed
    1. [Dataset] The US Office of the National Coordinator for Health Information Technology Office-Based Physician Electronic Health Record Adoption (2016). Available online at: dashboard.healthit.gov/quickstats/pages/physician-ehr-adoption-trends.php
    1. Bell B, Thornton K. From promise to reality: achieving the value of an EHR. Healthc Financ Manage. (2011) 65:50–6. - PubMed

Publication types