Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 5;10(2):13.
doi: 10.1167/tvst.10.2.13.

Addressing Artificial Intelligence Bias in Retinal Diagnostics

Affiliations

Addressing Artificial Intelligence Bias in Retinal Diagnostics

Philippe Burlina et al. Transl Vis Sci Technol. .

Abstract

Purpose: This study evaluated generative methods to potentially mitigate artificial intelligence (AI) bias when diagnosing diabetic retinopathy (DR) resulting from training data imbalance or domain generalization, which occurs when deep learning systems (DLSs) face concepts at test/inference time they were not initially trained on.

Methods: The public domain Kaggle EyePACS dataset (88,692 fundi and 44,346 individuals, originally diverse for ethnicity) was modified by adding clinician-annotated labels and constructing an artificial scenario of data imbalance and domain generalization by disallowing training (but not testing) exemplars for images of retinas with DR warranting referral (DR-referable) from darker-skin individuals, who presumably have greater concentration of melanin within uveal melanocytes, on average, contributing to retinal image pigmentation. A traditional/baseline diagnostic DLS was compared against new DLSs that would use training data augmented via generative models for debiasing.

Results: Accuracy (95% confidence intervals [CIs]) of the baseline diagnostics DLS for fundus images of lighter-skin individuals was 73.0% (66.9% to 79.2%) versus darker-skin of 60.5% (53.5% to 67.3%), demonstrating bias/disparity (delta = 12.5%; Welch t-test t = 2.670, P = 0.008) in AI performance across protected subpopulations. Using novel generative methods for addressing missing subpopulation training data (DR-referable darker-skin) achieved instead accuracy, for lighter-skin, of 72.0% (65.8% to 78.2%), and for darker-skin, of 71.5% (65.2% to 77.8%), demonstrating closer parity (delta = 0.5%) in accuracy across subpopulations (Welch t-test t = 0.111, P = 0.912).

Conclusions: Findings illustrate how data imbalance and domain generalization can lead to disparity of accuracy across subpopulations, and show that novel generative methods of synthetic fundus images may play a role for debiasing AI.

Translational relevance: New AI methods have possible applications to address potential AI bias in DR diagnostics from fundus pigmentation, and potentially other ophthalmic DLSs too.

PubMed Disclaimer

Conflict of interest statement

Disclosure: P. Burlina, None; N. Joshi, None; W. Paul, None; K. D. Pacheco, None; N. M. Bressler, None

Figures

Figure 1.
Figure 1.
The figures show right versus left pairs of synthetically created images that demonstrate alterations done automatically via our generative methods, that take a synthetic retinal image as input (shown on the left), to generate a new retinal image that is of an individual with diabetic retinopathy warranting referral to health care provider and of a darker-skin individual as defined in the Methods section. These generative methods are used in this study to generate images that are originally missing from the training set (i.e., of referable DR from darker-skin individuals), creating a condition of unbalance and bias. Pairs of images in (a1) to (a3) illustrate how the proposed generative methods are used to generate new retinal images that take an input retinal image on the left, of a darker-skin individual, and accentuate the attribute “DR-referable” in the image on the right, when compared to the left image, and leave the amount of coloration reflective of the melanin concentration within the uveal melanocytes and all other markers unchanged. The first pair (a1) starts from a retina that is not referable but of a darker-skin individual (left image has DR level 0 or 1; i.e. no or mild DR) and converts it into one that is referable (right image is DR level 2; i.e. moderate DR) while minimally changing other attributes of the retina (the right image is also from a darker-skin individual and vasculature aspect is unchanged). Likewise, the left image pair in (a2) is of a retina from a darker-skin individual, that is not referable (left image is DR level 0 or 1) and our method then accentuates the referable attribute (right image is DR level 2) to make it referable. The same explanation applies to (a3). Pairs of images in (b1) to (b3) instead demonstrate our complementary approach: taking as input retinal images that are already referable (left images in the pair) and altering them to accentuate the attribute “darker-skin individuals,” while preserving the DR lesions as well as the vasculature, in order to generate output images that are referable and from darker-skin individuals (right images in the pairs). The image in (b1) in particular is already from a retina, which is referable and with higher concentration of melanin within the uveal melanocytes and the method visibly accentuates melanin concentration in the right image, and both input (left image) and output (right image) have moderate DR (DR level 2). The image in (b2) is an example where the left image is of a lighter-skin individual and already referable and our method modifies it by generating a related image of a darker-skin individual; but the method preserves the DR level, as both right and left images have visibly unchanged level 2 DR, with potential retinal hemorrhages seen. The image in (b3) is a similar example, accentuating the left retinal image, which is of a lighter-skin individual and referable DR, and turning it into the right retinal image of a darker-skin individual, without altering the DR level (again here both right and left images have apparently DR level 2 with retinal hemorrhages).
Figure 2.
Figure 2.
This figure details the flow chart for the debiasing algorithmic and experimental pipeline.
Figure 3.
Figure 3.
The receiver operating characteristic (ROC) curves for each population of lighter-skin and darker-skin individuals for both the baseline and debiased DLS (for both retinal appearance and DR optimized approaches). DS: dark skin, LS: light skin.

References

    1. Gulshan V, Peng L, Coram M, et al. .. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016; 316(22): 2402–2410. - PubMed
    1. Ting DS, Liu Y, Burlina P, Xu X, Bressler NM, Wong TY. AI for medical imaging goes deep. Nat. Med. 2018; 24(5): 539–540. - PubMed
    1. Burlina P, Pacheco KD, Joshi N, Freund DE, Bressler NM. Comparing humans and deep learning performance for grading AMD: a study in using universal deep features and transfer learning for automated AMD analysis. Comput Biol Med. 2017; 82: 80–86. - PMC - PubMed
    1. Parikh RB, Teeple S, Navathe AS, Addressing bias in artificial intelligence in health care. JAMA. 2019; 22(24): 2377–2378. - PubMed
    1. Wakamatsu K, Hu DN, McCormick SA, Ito S. Characterization of melanin in human iridal and choroidal melanocytes from eyes with various colored irides. Pigment Cell Res. 2008; 21: 97–105. - PubMed

Publication types