The path toward equal performance in medical machine learning

doi:10.1016/j.patter.2023.100790

Review

. 2023 Jul 14;4(7):100790.

doi: 10.1016/j.patter.2023.100790.

The path toward equal performance in medical machine learning

Eike Petersen^{1

2}, Sune Holm^{2

3}, Melanie Ganz^{2

4

5}, Aasa Feragen^{1

2}

Affiliations

¹ DTU Compute, Technical University of Denmark, Richard Pedersens Plads, 2800 Kgs. Lyngby, Denmark.
² Pioneer Centre for AI, Øster Voldgade 3, 1350 Copenhagen, Denmark.
³ Department of Food and Resource Economics, University of Copenhagen, Rolighedsvej 23, 1958 Frederiksberg C., Denmark.
⁴ Department of Computer Science, University of Copenhagen, Universitetsparken 1, 2100 Copenhagen, Denmark.
⁵ Neurobiology Research Unit, Rigshospitalet, Inge Lehmanns Vej 6-8, 2100 Copenhagen, Denmark.

PMID: 37521051
PMCID: PMC10382979
DOI: 10.1016/j.patter.2023.100790

Review

The path toward equal performance in medical machine learning

Eike Petersen et al. Patterns (N Y). 2023.

. 2023 Jul 14;4(7):100790.

doi: 10.1016/j.patter.2023.100790.

Authors

Eike Petersen^{1

2}, Sune Holm^{2

3}, Melanie Ganz^{2

4

5}, Aasa Feragen^{1

2}

Affiliations

¹ DTU Compute, Technical University of Denmark, Richard Pedersens Plads, 2800 Kgs. Lyngby, Denmark.
² Pioneer Centre for AI, Øster Voldgade 3, 1350 Copenhagen, Denmark.
³ Department of Food and Resource Economics, University of Copenhagen, Rolighedsvej 23, 1958 Frederiksberg C., Denmark.
⁴ Department of Computer Science, University of Copenhagen, Universitetsparken 1, 2100 Copenhagen, Denmark.
⁵ Neurobiology Research Unit, Rigshospitalet, Inge Lehmanns Vej 6-8, 2100 Copenhagen, Denmark.

PMID: 37521051
PMCID: PMC10382979
DOI: 10.1016/j.patter.2023.100790

Abstract

To ensure equitable quality of care, differences in machine learning model performance between patient groups must be addressed. Here, we argue that two separate mechanisms can cause performance differences between groups. First, model performance may be worse than theoretically achievable in a given group. This can occur due to a combination of group underrepresentation, modeling choices, and the characteristics of the prediction task at hand. We examine scenarios in which underrepresentation leads to underperformance, scenarios in which it does not, and the differences between them. Second, the optimal achievable performance may also differ between groups due to differences in the intrinsic difficulty of the prediction task. We discuss several possible causes of such differences in task difficulty. In addition, challenges such as label biases and selection biases may confound both learning and performance evaluation. We highlight consequences for the path toward equal performance, and we emphasize that leveling up model performance may require gathering not only more data from underperforming groups but also better data. Throughout, we ground our discussion in real-world medical phenomena and case studies while also referencing relevant statistical theory.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Illustrations of different cases of binary classification under group underrepresentation Circles and crosses denote the two possible outcomes (values of y), blue (majority) and red (minority), two patient groups of interest. The variables $x_{1}$ and $x_{2}$ denote model inputs. (A) Group underrepresentation is not problematic if the same decision boundary is optimal for all groups. (B) If the optimal decision boundaries differ between groups, and either the model or the input data are not sufficiently expressive to capture the optimal decision boundaries for all groups simultaneously, standard (empirical risk minimizing) learning approaches will optimize for performance in the majority group (here, the blue group). (C) An expressive model could learn a decision boundary (red) that is optimal for both groups. In practice, however, it is unclear whether a training procedure will indeed identify this optimal boundary. This is due to inductive biases, local optimization schemes, and limited dataset size for the minority groups, all combined with standard empirical risk minimization, which prioritizes optimizing performance for the majority group.

**Figure 2**
Illustrations of different causes of performance disparities in binary classification Circles and crosses denote the two possible outcomes (values of y), blue and red mark two patient groups of interest. The variables $x_{1}$ and $x_{2}$ denote model inputs. (A) Higher levels of input noise will lead to worse classification performance in the red group compared with the blue group. This might be a symptom of an unobserved cause of the outcome that is more influential in the red group than in the blue group, cf. (B). (B) Without knowledge of the additional variable v, the blue group can be correctly classified based just on x (dotted line). This is not possible for the red group, however, which requires a decision boundary taking the additional variable v into account (dashed line). (C) Completely random label noise will lead to worse performance metric estimates in the red group compared with the blue group, even though model performance with respect to the true labels is identical. The empty circle indicates a true circle mislabeled as a cross; the star indicates the inverse. (D) Systematic label errors will lead to worse model performance (with respect to the true outcome labels) in the red group compared with the blue group, because a suboptimal decision boundary (red) is learned instead of the optimal one (gray). If the same systematic label errors are present in the test set, this is undetectable.

See this image and copyright information in PMC

Cited by

The limits of fair medical imaging AI in real-world generalization.
Yang Y, Zhang H, Gichoya JW, Katabi D, Ghassemi M. Yang Y, et al. Nat Med. 2024 Oct;30(10):2838-2848. doi: 10.1038/s41591-024-03113-4. Epub 2024 Jun 28. Nat Med. 2024. PMID: 38942996 Free PMC article.
Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study.
Schaekermann M, Spitz T, Pyles M, Cole-Lewis H, Wulczyn E, Pfohl SR, Martin D Jr, Jaroensri R, Keeling G, Liu Y, Farquhar S, Xue Q, Lester J, Hughes C, Strachan P, Tan F, Bui P, Mermel CH, Peng LH, Matias Y, Corrado GS, Webster DR, Virmani S, Semturs C, Liu Y, Horn I, Cameron Chen PH. Schaekermann M, et al. EClinicalMedicine. 2024 Mar 14;70:102479. doi: 10.1016/j.eclinm.2024.102479. eCollection 2024 Apr. EClinicalMedicine. 2024. PMID: 38685924 Free PMC article.
An Investigation into Race Bias in Random Forest Models Based on Breast DCE-MRI Derived Radiomics Features.
Huti M, Lee T, Sawyer E, King AP. Huti M, et al. Clin Image Based Proced Fairness AI Med Imaging Ethical Philos Issues Med Imaging (2023). 2023;14242:225-234. doi: 10.1007/978-3-031-45249-9_22. Epub 2023 Oct 9. Clin Image Based Proced Fairness AI Med Imaging Ethical Philos Issues Med Imaging (2023). 2023. PMID: 39404661 Free PMC article.
The Permissibility of Biased AI in a Biased World: An Ethical Analysis of AI for Screening and Referrals for Diabetic Retinopathy in Singapore.
Muyskens K, Ballantyne A, Savulescu J, Nasir HU, Muralidharan A. Muyskens K, et al. Asian Bioeth Rev. 2024 Oct 31;17(1):167-185. doi: 10.1007/s41649-024-00315-3. eCollection 2025 Jan. Asian Bioeth Rev. 2024. PMID: 39896078 Free PMC article.
A scoping review and evidence gap analysis of clinical AI fairness.
Liu M, Ning Y, Teixayavong S, Liu X, Mertens M, Shang Y, Li X, Miao D, Liao J, Xu J, Ting DSW, Cheng LT, Ong JCL, Teo ZL, Tan TF, RaviChandran N, Wang F, Celi LA, Ong MEH, Liu N. Liu M, et al. NPJ Digit Med. 2025 Jun 14;8(1):360. doi: 10.1038/s41746-025-01667-2. NPJ Digit Med. 2025. PMID: 40517148 Free PMC article.

See all "Cited by" articles

References

1. Buolamwini J., Gebru T. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency. Friedler S.A., Wilson C., editors. Vol. 81. PMLR; 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification; pp. 77–91.http://proceedings.mlr.press/v81/buolamwini18a.html
1. Ricci Lara M.A., Echeveste R., Ferrante E. Addressing fairness in artificial intelligence for medical imaging. Nat. Commun. 2022;13:4581. doi: 10.1038/s41467-022-32186-3. - DOI - PMC - PubMed
1. Larrazabal A.J., Nieto N., Peterson V., Milone D.H., Ferrante E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl. Acad. Sci. USA. 2020;117:12592–12594. doi: 10.1073/pnas.1919012117. - DOI - PMC - PubMed
1. Seyyed-Kalantari L., Liu G., McDermott M., Chen I.Y., Ghassemi M. Pacific Symposium on Biocomputing. World Scientific; 2020. CheXclusion: Fairness gaps in deep chest X-ray classifiers. - DOI - PubMed
1. Rajkomar A., Hardt M., Howell M.D., Corrado G., Chin M.H. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann. Intern. Med. 2018;169:866–872. doi: 10.7326/m18-1990. - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources

[1] Buolamwini J., Gebru T. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency. Friedler S.A., Wilson C., editors. Vol. 81. PMLR; 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification; pp. 77–91.http://proceedings.mlr.press/v81/buolamwini18a.html

[2] Buolamwini J., Gebru T. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency. Friedler S.A., Wilson C., editors. Vol. 81. PMLR; 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification; pp. 77–91.http://proceedings.mlr.press/v81/buolamwini18a.html

[3] Ricci Lara M.A., Echeveste R., Ferrante E. Addressing fairness in artificial intelligence for medical imaging. Nat. Commun. 2022;13:4581. doi: 10.1038/s41467-022-32186-3. - DOI - PMC - PubMed

[4] Ricci Lara M.A., Echeveste R., Ferrante E. Addressing fairness in artificial intelligence for medical imaging. Nat. Commun. 2022;13:4581. doi: 10.1038/s41467-022-32186-3. - DOI - PMC - PubMed

[5] Larrazabal A.J., Nieto N., Peterson V., Milone D.H., Ferrante E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl. Acad. Sci. USA. 2020;117:12592–12594. doi: 10.1073/pnas.1919012117. - DOI - PMC - PubMed

[6] Larrazabal A.J., Nieto N., Peterson V., Milone D.H., Ferrante E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl. Acad. Sci. USA. 2020;117:12592–12594. doi: 10.1073/pnas.1919012117. - DOI - PMC - PubMed

[7] Seyyed-Kalantari L., Liu G., McDermott M., Chen I.Y., Ghassemi M. Pacific Symposium on Biocomputing. World Scientific; 2020. CheXclusion: Fairness gaps in deep chest X-ray classifiers. - DOI - PubMed

[8] Seyyed-Kalantari L., Liu G., McDermott M., Chen I.Y., Ghassemi M. Pacific Symposium on Biocomputing. World Scientific; 2020. CheXclusion: Fairness gaps in deep chest X-ray classifiers. - DOI - PubMed

[9] Rajkomar A., Hardt M., Howell M.D., Corrado G., Chin M.H. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann. Intern. Med. 2018;169:866–872. doi: 10.7326/m18-1990. - DOI - PMC - PubMed

[10] Rajkomar A., Hardt M., Howell M.D., Corrado G., Chin M.H. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann. Intern. Med. 2018;169:866–872. doi: 10.7326/m18-1990. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The path toward equal performance in medical machine learning

Affiliations

The path toward equal performance in medical machine learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

LinkOut - more resources

Full Text Sources