Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Jun;7(6):719-742.
doi: 10.1038/s41551-023-01056-8. Epub 2023 Jun 28.

Algorithmic fairness in artificial intelligence for medicine and healthcare

Affiliations
Review

Algorithmic fairness in artificial intelligence for medicine and healthcare

Richard J Chen et al. Nat Biomed Eng. 2023 Jun.

Abstract

In healthcare, the development and deployment of insufficiently fair systems of artificial intelligence (AI) can undermine the delivery of equitable care. Assessments of AI models stratified across subpopulations have revealed inequalities in how patients are diagnosed, treated and billed. In this Perspective, we outline fairness in machine learning through the lens of healthcare, and discuss how algorithmic biases (in data acquisition, genetic variation and intra-observer labelling variability, in particular) arise in clinical workflows and the resulting healthcare disparities. We also review emerging technology for mitigating biases via disentanglement, federated learning and model explainability, and their role in the development of AI-based software as a medical device.

PubMed Disclaimer

Figures

Fig. 1 |
Fig. 1 |
Connecting healthcare disparities and dataset shifts to algorithm fairness. A, Population shift as a result of genetic variation and of other population-specific phenotypes across subpopulations. Current AI algorithms for the diagnosis of skin cancer using dermoscopic and macroscopic photographs may be developed on datasets that underrepresent darker skin types, which may exacerbate health disparities of some geographic regions,,,. In developing algorithms using datasets that overrepresent individuals with European ancestry, the prevalence of certain mutations may also differ in the training and test distributions. This is the case for disparities in EGFR-mutation frequencies across European and Asian populations. B, Population shifts and prevalence shifts resulting from disparities in social determinants of health. Differences in healthcare access may result in delayed referrals, and cause later-stage disease diagnoses and worsened mortality rates,. C, Concept shift as a result of the ongoing refinement of medical-classification systems, such as the recategorization of strokes, which was previously defined under diseases of the circulatory system in ICD-10 and is now defined under neurological disorders in ICD-11,. In other taxonomies, such as the Banff-classification system for renal allograft assessment, which updates the diagnostic criteria approximately every two years, the use of the post-2018 Banff criteria for borderline cases of T-cell-mediated rejection (TCMR), all i0,t1-score biopsies would be classified as ‘normal’. D, Acquisition shift as a result of differing data-curation protocols (associated with the use of different MRI/CT scanners, radiation dosages, sample-preparation protocols or image-acquisition parameters) may induce batch effects in the data,. E, Novel or insufficiently understood occurrences, such as interactions between the SARS-CoV-2 virus and lung cancer, may arise in new types of dataset shift such as open set label shift. F, Global-health challenges in the deployment of AI-SaMDs in low-and-middle-income countries can lead to resource constraints for AI-SaMDs, such as limitations in GPU resources, a lack of digitization of medical records and other health data, as well as dataset-shift barriers such as differing demographics, disease prevalence, classification systems, and data-curation protocols. Group fairness criteria may also be difficult to satisfy when AI-SaMD deployment faces constraints in the access of protected health information.
Fig. 2 |
Fig. 2 |. Strategies for mitigating disparate impact.
a, For under-represented samples in the training and test datasets, importance weighting can be applied to reweight the infrequent samples so that their distribution matches in the two datasets. The schematic shows that, before importance reweighting, a model that overfits to samples with a low tumour volume in the training distribution (blue) underfits a test distribution that has more cases with large tumour volumes. For the model to better fit the test distribution, importance reweighting can be used to increase the importance of the cases with large tumour volumes (denoted by larger image sizes). b, To remove protected attributes in the representation space of structured data (CT imaging data or text data such as intensive care unit (ICU) notes), deep-learning algorithms can be further supervised with the protected attribute used as a target label, so that the loss function for the prediction of the attribute is maximized. Such strategies are also referred to as ‘debiasing’. Clinical images can include subtle biases that may leak protected-attribute information, such as age, gender and self-reported race, as has been shown for fundus photography and chest radiography. Y and A denote, respectively, the model’s outcome and a protected attribute. LSTM, long short-term memory; MLP, multilayer perceptron.
Fig. 3 |
Fig. 3 |. Genetic drift as population shift.
Demographic characteristics and gene-mutation frequencies for EGFR in patients with lung adenocarcinoma in the TCGA-LUAD and PIONEER cohorts. Of the 566 patients with lung adenocarcinoma in the TCGA, only 1.4% (n = 8) self-reported as ‘Asian’; in the PIONEER cohort, 1,482 patients did. The PIONEER study included a more fine-grained characterization of self-reported ethnicity and nationality: Mandarin Chinese, Cantonese, Taiwanese and Vietnamese, Thai, Filipino and Indian. Because of the underrepresentation of Asian patients in the TCGA, the mutation frequency for EGFR, which is commonly used in guiding the use of tyrosine kinase inhibitors as treatment, was only 37.5% (n = 3). For the PIONEER cohort, the overall EGFR-mutation frequency for all Asian patients was 51.4% (n = 653), and different ethnic subpopulations had different EGFR-mutation frequencies.
Fig. 4 |
Fig. 4 |. Dataset shifts in the deployment of AI-SaMDs for a clinical-grade AI algorithms.
a, Examples of site-specific H\&E stain variability under different whole slide scanners, resulting in variable histologic tissue appearance. b, Example of variations in CT scans acquired at two different centers. The histograms shows the radiointensity in normal liver tissue and in the liver lesions. Due to differences in the acquisition protocols, there might be significant overlap between CT values of normal liver from one center and tumor values from another center.
Fig. 5 |
Fig. 5 |. A decentralized framework that integrates federated learning with adversarial learning and disentanglement.
In addition to aiding the development of algorithms using larger and more diverse patient populations, federated learning can be integrated with many techniques in representation learning and in unsupervised domain adaptation that can learn in the presence of unobserved protected attributes. In federated learning, global and local weights are shared between the global server and the local clients (such as different hospitals in different countries), each with different datasets of whole-slide images (WSIs) and image patches. Different domain-adaptation methods can be used with federated learning. In federated adversarial and debiasing (FADE), the client IDs were used as protected attributes, and adversarial learning was used to debias the representation so that it did not vary with geographic region (red). In FedDis, shape and appearance features in brain MRI scans were disentangled, with only the shape parameter shared between clients (orange). In federated adversarial domain adaptation (FADA), disentanglement and adversarial learning were used to further mitigate domain shifts across clients (red and orange). Federated learning can also be used in combination with style transfer, synthetic data generation, and image normalization. In these cases, domain-adapted target data or features would need to be shared, or other techniques employed (green),,,,,,,. Y and A denote, respectively, the model’s outcome and a protected attribute.

References

    1. Buolamwini J & Gebru T Gender shades: intersectional accuracy disparities in commercial gender classification. In Conf. on Fairness, Accountability and Transparency 77–91 (PMLR, 2018).
    1. Obermeyer Z, Powers B, Vogeli C & Mullainathan S Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019). - PubMed
    1. Pierson E, Cutler DM, Leskovec J, Mullainathan S & Obermeyer Z An algorithmic approach to reducing unexplained pain disparities in underserved populations. Nat. Med 27, 136–140 (2021). - PubMed
    1. Hooker S Moving beyond ‘algorithmic bias is a data problem’. Patterns 2, 100241 (2021). - PMC - PubMed
    1. McCradden MD, Joshi S, Mazwi M & Anderson JA Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit. Health 2, e221–e223 (2020). - PubMed

Publication types