Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Jul 14;4(7):100790.
doi: 10.1016/j.patter.2023.100790.

The path toward equal performance in medical machine learning

Affiliations
Review

The path toward equal performance in medical machine learning

Eike Petersen et al. Patterns (N Y). .

Abstract

To ensure equitable quality of care, differences in machine learning model performance between patient groups must be addressed. Here, we argue that two separate mechanisms can cause performance differences between groups. First, model performance may be worse than theoretically achievable in a given group. This can occur due to a combination of group underrepresentation, modeling choices, and the characteristics of the prediction task at hand. We examine scenarios in which underrepresentation leads to underperformance, scenarios in which it does not, and the differences between them. Second, the optimal achievable performance may also differ between groups due to differences in the intrinsic difficulty of the prediction task. We discuss several possible causes of such differences in task difficulty. In addition, challenges such as label biases and selection biases may confound both learning and performance evaluation. We highlight consequences for the path toward equal performance, and we emphasize that leveling up model performance may require gathering not only more data from underperforming groups but also better data. Throughout, we ground our discussion in real-world medical phenomena and case studies while also referencing relevant statistical theory.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Illustrations of different cases of binary classification under group underrepresentation Circles and crosses denote the two possible outcomes (values of y), blue (majority) and red (minority), two patient groups of interest. The variables x1 and x2 denote model inputs. (A) Group underrepresentation is not problematic if the same decision boundary is optimal for all groups. (B) If the optimal decision boundaries differ between groups, and either the model or the input data are not sufficiently expressive to capture the optimal decision boundaries for all groups simultaneously, standard (empirical risk minimizing) learning approaches will optimize for performance in the majority group (here, the blue group). (C) An expressive model could learn a decision boundary (red) that is optimal for both groups. In practice, however, it is unclear whether a training procedure will indeed identify this optimal boundary. This is due to inductive biases, local optimization schemes, and limited dataset size for the minority groups, all combined with standard empirical risk minimization, which prioritizes optimizing performance for the majority group.
Figure 2
Figure 2
Illustrations of different causes of performance disparities in binary classification Circles and crosses denote the two possible outcomes (values of y), blue and red mark two patient groups of interest. The variables x1 and x2 denote model inputs. (A) Higher levels of input noise will lead to worse classification performance in the red group compared with the blue group. This might be a symptom of an unobserved cause of the outcome that is more influential in the red group than in the blue group, cf. (B). (B) Without knowledge of the additional variable v, the blue group can be correctly classified based just on x (dotted line). This is not possible for the red group, however, which requires a decision boundary taking the additional variable v into account (dashed line). (C) Completely random label noise will lead to worse performance metric estimates in the red group compared with the blue group, even though model performance with respect to the true labels is identical. The empty circle indicates a true circle mislabeled as a cross; the star indicates the inverse. (D) Systematic label errors will lead to worse model performance (with respect to the true outcome labels) in the red group compared with the blue group, because a suboptimal decision boundary (red) is learned instead of the optimal one (gray). If the same systematic label errors are present in the test set, this is undetectable.

Similar articles

Cited by

References

    1. Buolamwini J., Gebru T. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency. Friedler S.A., Wilson C., editors. Vol. 81. PMLR; 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification; pp. 77–91.http://proceedings.mlr.press/v81/buolamwini18a.html
    1. Ricci Lara M.A., Echeveste R., Ferrante E. Addressing fairness in artificial intelligence for medical imaging. Nat. Commun. 2022;13:4581. doi: 10.1038/s41467-022-32186-3. - DOI - PMC - PubMed
    1. Larrazabal A.J., Nieto N., Peterson V., Milone D.H., Ferrante E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl. Acad. Sci. USA. 2020;117:12592–12594. doi: 10.1073/pnas.1919012117. - DOI - PMC - PubMed
    1. Seyyed-Kalantari L., Liu G., McDermott M., Chen I.Y., Ghassemi M. Pacific Symposium on Biocomputing. World Scientific; 2020. CheXclusion: Fairness gaps in deep chest X-ray classifiers. - DOI - PubMed
    1. Rajkomar A., Hardt M., Howell M.D., Corrado G., Chin M.H. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann. Intern. Med. 2018;169:866–872. doi: 10.7326/m18-1990. - DOI - PMC - PubMed

LinkOut - more resources