Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan;6(1):e12-e22.
doi: 10.1016/S2589-7500(23)00225-X.

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

Affiliations
Free article

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

Travis Zack et al. Lancet Digit Health. 2024 Jan.
Free article

Erratum in

Abstract

Background: Large language models (LLMs) such as GPT-4 hold great promise as transformative tools in health care, ranging from automating administrative tasks to augmenting clinical decision making. However, these models also pose a danger of perpetuating biases and delivering incorrect medical diagnoses, which can have a direct, harmful impact on medical care. We aimed to assess whether GPT-4 encodes racial and gender biases that impact its use in health care.

Methods: Using the Azure OpenAI application interface, this model evaluation study tested whether GPT-4 encodes racial and gender biases and examined the impact of such biases on four potential applications of LLMs in the clinical domain-namely, medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment. We conducted experiments with prompts designed to resemble typical use of GPT-4 within clinical and medical education applications. We used clinical vignettes from NEJM Healer and from published research on implicit bias in health care. GPT-4 estimates of the demographic distribution of medical conditions were compared with true US prevalence estimates. Differential diagnosis and treatment planning were evaluated across demographic groups using standard statistical tests for significance between groups.

Findings: We found that GPT-4 did not appropriately model the demographic diversity of medical conditions, consistently producing clinical vignettes that stereotype demographic presentations. The differential diagnoses created by GPT-4 for standardised clinical vignettes were more likely to include diagnoses that stereotype certain races, ethnicities, and genders. Assessment and plans created by the model showed significant association between demographic attributes and recommendations for more expensive procedures as well as differences in patient perception.

Interpretation: Our findings highlight the urgent need for comprehensive and transparent bias assessments of LLM tools such as GPT-4 for intended use cases before they are integrated into clinical care. We discuss the potential sources of these biases and potential mitigation strategies before clinical implementation.

Funding: Priscilla Chan and Mark Zuckerberg.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests TZ reports no external financial interests; he works in an unpaid role as a clinical consultant with Xyla. EL reports personal fees and equity from Xyla. MS reports personal fees from Xyla and serves as an intern at Microsoft Research. LAC reports travel support from Australia New Zealand College of Intensive Care Medicine, cloud credits from Oracle, Amazon, and Google, and a role as Editor-in-Chief of PLOS Digital Health. JG reports support from the US National Science Foundation (grant #1928481), Radiological Society of North America (grant #EIHD2204), National Institutes of Health (grants 75N92020C00008 and 75N920), AIM-AHEAD, DeepLook, Clarity consortium, and GE Edison; received honoraria from the National Bureau of Economic Research; and has leadership roles with SIIM, HL7, and the ACR Advisory Committee. R-EEA is an employee of Massachusetts Medical Society, which owns NEJM Healer (NEJM Healer cases were used in the study). DWB reports grants and personal fees from EarlySense; personal fees from CDI Negev; equity from ValeraHealth, Clew, MDClone, and Guided Clinical Solutions; personal fees and equity from AESOP and Feelbetter; and grants from IBM Watson Health, outside the submitted work. DWB also has a patent pending (PHC-028564US PCT) on intraoperative clinical decision support. AJB is a cofounder and consultant to Personalis and NuMedii; consultant to Mango Tree Corporation and in the recent past, to Samsung, 10x Genomics, Helix, Pathway Genomics, and Verinata (Illumina); has served on paid advisory panels or boards for Geisinger Health, Regenstrief Institute, Gerson Lehman Group, AlphaSights, Covance, Novartis, Genentech, Merck, and Roche; is a shareholder in Personalis and NuMedii; is a minor shareholder in Apple, Meta (Facebook), Alphabet (Google), Microsoft, Amazon, Snap, 10x Genomics, Illumina, Regeneron, Sanofi, Pfizer, Royalty Pharma, Moderna, Sutro, Doximity, BioNtech, Invitae, Pacific Biosciences, Editas Medicine, Nuna Health, Assay Depot, Vet24seven, and several other non-health related companies and mutual funds; and has received honoraria and travel reimbursement for invited talks from Johnson & Johnson, Roche, Genentech, Pfizer, Merck, Lilly, Takeda, Varian, Mars, Siemens, Optum, Abbott, Celgene, AstraZeneca, AbbVie, Westat, and many academic institutions, medical or disease specific foundations and associations, and health systems. AJB also receives royalty payments through Stanford University for several patents and other disclosures licensed to NuMedii and Personalis. AJB's research has been funded by the National Institutes of Health, Peraton (as the prime on a National Institutes of Health contract), Genentech, Johnson & Johnson, US Food and Drug Administration, Robert Wood Johnson Foundation, Leon Lowenstein Foundation, Intervalien Foundation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, and in the recent past, the March of Dimes, Juvenile Diabetes Research Foundation, California Governor's Office of Planning and Research, California Institute for Regenerative Medicine, L’Oreal, and Progenity. EA reports personal fees from Canopy Innovations, Fourier Health, and Xyla; and grants from Microsoft Research. None of these entities had any role in the design, execution, evaluation, or writing of this manuscript. All other authors declare no competing interests.

Comment in

References

Publication types

LinkOut - more resources