Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance

doi:10.1007/s00330-024-10834-0

Review

. 2024 Dec;34(12):7895-7903.

doi: 10.1007/s00330-024-10834-0. Epub 2024 Jun 11.

Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance

Candelaria Mosquera^{1

2}, Luciana Ferrer³, Diego H Milone⁴, Daniel Luna⁵, Enzo Ferrante⁶

Affiliations

¹ Hospital Italiano de Buenos Aires, Buenos Aires, Argentina. candelaria.mosquera@hospitalitaliano.org.ar.
² Universidad Tecnológica Nacional, Buenos Aires, Argentina. candelaria.mosquera@hospitalitaliano.org.ar.
³ Instituto de Ciencias de la Computación, UBA-CONICET, Buenos Aires, Argentina.
⁴ Institute for Signals, Systems, and Computational Intelligence, sinc(i) CONICET-UNL, Santa Fe, Argentina.
⁵ Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.
⁶ Institute for Signals, Systems, and Computational Intelligence, sinc(i) CONICET-UNL, Santa Fe, Argentina. eferrante@sinc.unl.edu.ar.

PMID: 38861161
DOI: 10.1007/s00330-024-10834-0

Review

Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance

Candelaria Mosquera et al. Eur Radiol. 2024 Dec.

. 2024 Dec;34(12):7895-7903.

doi: 10.1007/s00330-024-10834-0. Epub 2024 Jun 11.

Authors

Candelaria Mosquera^{1

2}, Luciana Ferrer³, Diego H Milone⁴, Daniel Luna⁵, Enzo Ferrante⁶

Affiliations

¹ Hospital Italiano de Buenos Aires, Buenos Aires, Argentina. candelaria.mosquera@hospitalitaliano.org.ar.
² Universidad Tecnológica Nacional, Buenos Aires, Argentina. candelaria.mosquera@hospitalitaliano.org.ar.
³ Instituto de Ciencias de la Computación, UBA-CONICET, Buenos Aires, Argentina.
⁴ Institute for Signals, Systems, and Computational Intelligence, sinc(i) CONICET-UNL, Santa Fe, Argentina.
⁵ Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.
⁶ Institute for Signals, Systems, and Computational Intelligence, sinc(i) CONICET-UNL, Santa Fe, Argentina. eferrante@sinc.unl.edu.ar.

PMID: 38861161
DOI: 10.1007/s00330-024-10834-0

Abstract

Purpose: This work aims to assess standard evaluation practices used by the research community for evaluating medical imaging classifiers, with a specific focus on the implications of class imbalance. The analysis is performed on chest X-rays as a case study and encompasses a comprehensive model performance definition, considering both discriminative capabilities and model calibration.

Materials and methods: We conduct a concise literature review to examine prevailing scientific practices used when evaluating X-ray classifiers. Then, we perform a systematic experiment on two major chest X-ray datasets to showcase a didactic example of the behavior of several performance metrics under different class ratios and highlight how widely adopted metrics can conceal performance in the minority class.

Results: Our literature study confirms that: (1) even when dealing with highly imbalanced datasets, the community tends to use metrics that are dominated by the majority class; and (2) it is still uncommon to include calibration studies for chest X-ray classifiers, albeit its importance in the context of healthcare. Moreover, our systematic experiments confirm that current evaluation practices may not reflect model performance in real clinical scenarios and suggest complementary metrics to better reflect the performance of the system in such scenarios.

Conclusion: Our analysis underscores the need for enhanced evaluation practices, particularly in the context of class-imbalanced chest X-ray classifiers. We recommend the inclusion of complementary metrics such as the area under the precision-recall curve (AUC-PR), adjusted AUC-PR, and balanced Brier score, to offer a more accurate depiction of system performance in real clinical scenarios, considering metrics that reflect both, discrimination and calibration performance.

Clinical relevance statement: This study underscores the critical need for refined evaluation metrics in medical imaging classifiers, emphasizing that prevalent metrics may mask poor performance in minority classes, potentially impacting clinical diagnoses and healthcare outcomes.

Key points: Common scientific practices in papers dealing with X-ray computer-assisted diagnosis (CAD) systems may be misleading. We highlight limitations in reporting of evaluation metrics for X-ray CAD systems in highly imbalanced scenarios. We propose adopting alternative metrics based on experimental evaluation on large-scale datasets.

Keywords: Computer-assisted diagnosis; Deep learning; Machine learning; Prevalence; X-rays.

PubMed Disclaimer

Conflict of interest statement

Compliance with ethical standards Guarantor The scientific guarantor of this publication is Candelaria Mosquera. Conflict of interest At the time of submission, Candelaria Mosquera is employed by Abi Global Health. The other authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article. Statistics and biometry No complex statistical methods were necessary for this paper. Informed consent Written informed consent was not required for this study because we performed a retrospective analysis using publicly available datasets of chest X-ray images. Ethical approval Institutional Review Board approval was not required for this study because we performed a retrospective analysis using publicly available datasets of chest X-ray images. Study subjects or cohorts overlap The study subjects are those reported in the publicly available datasets CheXpert and ChestX-ray 14 (NIH). Methodology RetrospectiveExperimentalMulticenter study

Comment in

When AUC-ROC and accuracy are not accurate: what everyone needs to know about evaluating artificial intelligence in radiology.
Huisman M. Huisman M. Eur Radiol. 2024 Dec;34(12):7892-7894. doi: 10.1007/s00330-024-10859-5. Epub 2024 Jun 24. Eur Radiol. 2024. PMID: 38913248 No abstract available.

Cited by

Hybrid transformer-based model for mammogram classification by integrating prior and current images.
Jeny AA, Hamzehei S, Jin A, Baker SA, Van Rathe T, Bai J, Yang C, Nabavi S. Jeny AA, et al. Med Phys. 2025 May;52(5):2999-3014. doi: 10.1002/mp.17650. Epub 2025 Jan 30. Med Phys. 2025. PMID: 39887755 Free PMC article.
CORE-MD clinical risk score for regulatory evaluation of artificial intelligence-based medical device software.
Rademakers FE, Biasin E, Bruining N, Caiani EG, Davies RH, Gilbert SH, Kamenjasevic E, McGauran G, O'Connor G, Rouffet JB, Vasey B, Fraser AG. Rademakers FE, et al. NPJ Digit Med. 2025 Feb 6;8(1):90. doi: 10.1038/s41746-025-01459-8. NPJ Digit Med. 2025. PMID: 39915308 Free PMC article. Review.
Determining risk and predictors of head and neck cancer treatment-related lymphedema: A clinicopathologic and dosimetric data mining approach using interpretable machine learning and ensemble feature selection.
Teo PT, Rogacki K, Gopalakrishnan M, Das IJ, Abazeed ME, Mittal BB, Gentile M. Teo PT, et al. Clin Transl Radiat Oncol. 2024 Feb 28;46:100747. doi: 10.1016/j.ctro.2024.100747. eCollection 2024 May. Clin Transl Radiat Oncol. 2024. PMID: 38450218 Free PMC article.
Pathological changes or technical artefacts? The problem of the heterogenous databases in COVID-19 CXR image analysis.
Socha M, Prażuch W, Suwalska A, Foszner P, Tobiasz J, Jaroszewicz J, Gruszczynska K, Sliwinska M, Nowak M, Gizycka B, Zapolska G, Popiela T, Przybylski G, Fiedor P, Pawlowska M, Flisiak R, Simon K, Walecki J, Cieszanowski A, Szurowska E, Marczyk M, Polanska J; POLCOVID Study Group. Socha M, et al. Comput Methods Programs Biomed. 2023 Oct;240:107684. doi: 10.1016/j.cmpb.2023.107684. Epub 2023 Jun 19. Comput Methods Programs Biomed. 2023. PMID: 37356354 Free PMC article.
Machine Learning-Based Prediction of Unplanned Readmission Due to Major Adverse Cardiac Events Among Hospitalized Patients with Blood Cancers.
Le N, Han S, Kenawy AS, Kim Y, Park C. Le N, et al. Cancer Control. 2025 Jan-Dec;32:10732748251332803. doi: 10.1177/10732748251332803. Epub 2025 Apr 17. Cancer Control. 2025. PMID: 40243279 Free PMC article.

See all "Cited by" articles

References

1. Yu KH, Beam AL, Kohane IS (2018) Artificial intelligence in healthcare. Nat Biomed Eng 2:719–731 - DOI - PubMed
1. Beam AL, Manrai AK, Ghassemi M (2020) Challenges to the reproducibility of machine learning models in health care. JAMA 323:305–306 - DOI - PubMed - PMC
1. Luque A, Carrasco A, Martín A, de las Heras A (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit 91:216–231 - DOI
1. Çallı E, Sogancioglu E, van Ginneken B, van Leeuwen KG, Murphy K (2021) Deep learning for chest X-ray analysis: a survey. Med Image Anal 72:102–125 - DOI
1. Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM (2017) Chestx-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 2097–2106

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Springer
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Yu KH, Beam AL, Kohane IS (2018) Artificial intelligence in healthcare. Nat Biomed Eng 2:719–731 - DOI - PubMed

[2] Yu KH, Beam AL, Kohane IS (2018) Artificial intelligence in healthcare. Nat Biomed Eng 2:719–731 - DOI - PubMed

[3] Beam AL, Manrai AK, Ghassemi M (2020) Challenges to the reproducibility of machine learning models in health care. JAMA 323:305–306 - DOI - PubMed - PMC

[4] Beam AL, Manrai AK, Ghassemi M (2020) Challenges to the reproducibility of machine learning models in health care. JAMA 323:305–306 - DOI - PubMed - PMC

[5] Luque A, Carrasco A, Martín A, de las Heras A (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit 91:216–231 - DOI

[6] Luque A, Carrasco A, Martín A, de las Heras A (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit 91:216–231 - DOI

[7] Çallı E, Sogancioglu E, van Ginneken B, van Leeuwen KG, Murphy K (2021) Deep learning for chest X-ray analysis: a survey. Med Image Anal 72:102–125 - DOI

[8] Çallı E, Sogancioglu E, van Ginneken B, van Leeuwen KG, Murphy K (2021) Deep learning for chest X-ray analysis: a survey. Med Image Anal 72:102–125 - DOI

[9] Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM (2017) Chestx-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 2097–2106

[10] Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM (2017) Chestx-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 2097–2106

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance

Affiliations

Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance

Authors

Affiliations

Abstract

Conflict of interest statement

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous

Abstract

Conflict of interest statement

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous