. 2024 Jan;6(1):e12-e22.

doi: 10.1016/S2589-7500(23)00225-X.

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

Affiliations

¹ Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA, USA.
² Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
³ Department of Computer Science, Stanford University, Stanford, CA, USA; Stanford Law School, Stanford University, Stanford, CA, USA.
⁴ Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA.
⁵ Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA; Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA; Department of Biostatistics, Harvard T H Chan School of Public Health, Boston, MA, USA.
⁶ Department of Radiology, Emory University, Atlanta, GA, USA.
⁷ Department of Computer Science, Stanford University, Stanford, CA, USA; Department of Linguistics, Stanford University, Stanford, CA, USA.
⁸ Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA; Department of Health Policy and Management, Harvard T H Chan School of Public Health, Boston, MA, USA.
⁹ Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.
¹⁰ Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA; Center for Data-Driven Insights and Innovation, University of California, Office of the President, Oakland, CA, USA.
¹¹ Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA. Electronic address: ealsentzer@bwh.harvard.edu.

PMID: 38123252
DOI: 10.1016/S2589-7500(23)00225-X

Free article

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

Travis Zack et al. Lancet Digit Health. 2024 Jan.

Free article

. 2024 Jan;6(1):e12-e22.

doi: 10.1016/S2589-7500(23)00225-X.

Authors

Affiliations

¹ Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA, USA.
² Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
³ Department of Computer Science, Stanford University, Stanford, CA, USA; Stanford Law School, Stanford University, Stanford, CA, USA.
⁴ Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA.
⁵ Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA; Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA; Department of Biostatistics, Harvard T H Chan School of Public Health, Boston, MA, USA.
⁶ Department of Radiology, Emory University, Atlanta, GA, USA.
⁷ Department of Computer Science, Stanford University, Stanford, CA, USA; Department of Linguistics, Stanford University, Stanford, CA, USA.
⁸ Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA; Department of Health Policy and Management, Harvard T H Chan School of Public Health, Boston, MA, USA.
⁹ Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.
¹⁰ Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA; Center for Data-Driven Insights and Innovation, University of California, Office of the President, Oakland, CA, USA.
¹¹ Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA. Electronic address: ealsentzer@bwh.harvard.edu.

PMID: 38123252
DOI: 10.1016/S2589-7500(23)00225-X

Erratum in

Correction to Lancet Digit Health 2024; 6: e12-22.
[No authors listed] [No authors listed] Lancet Digit Health. 2024 Jul;6(7):e445. doi: 10.1016/S2589-7500(24)00120-1. Lancet Digit Health. 2024. PMID: 38906610 No abstract available.

Abstract

Background: Large language models (LLMs) such as GPT-4 hold great promise as transformative tools in health care, ranging from automating administrative tasks to augmenting clinical decision making. However, these models also pose a danger of perpetuating biases and delivering incorrect medical diagnoses, which can have a direct, harmful impact on medical care. We aimed to assess whether GPT-4 encodes racial and gender biases that impact its use in health care.

Methods: Using the Azure OpenAI application interface, this model evaluation study tested whether GPT-4 encodes racial and gender biases and examined the impact of such biases on four potential applications of LLMs in the clinical domain-namely, medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment. We conducted experiments with prompts designed to resemble typical use of GPT-4 within clinical and medical education applications. We used clinical vignettes from NEJM Healer and from published research on implicit bias in health care. GPT-4 estimates of the demographic distribution of medical conditions were compared with true US prevalence estimates. Differential diagnosis and treatment planning were evaluated across demographic groups using standard statistical tests for significance between groups.

Findings: We found that GPT-4 did not appropriately model the demographic diversity of medical conditions, consistently producing clinical vignettes that stereotype demographic presentations. The differential diagnoses created by GPT-4 for standardised clinical vignettes were more likely to include diagnoses that stereotype certain races, ethnicities, and genders. Assessment and plans created by the model showed significant association between demographic attributes and recommendations for more expensive procedures as well as differences in patient perception.

Interpretation: Our findings highlight the urgent need for comprehensive and transparent bias assessments of LLM tools such as GPT-4 for intended use cases before they are integrated into clinical care. We discuss the potential sources of these biases and potential mitigation strategies before clinical implementation.

Funding: Priscilla Chan and Mark Zuckerberg.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests TZ reports no external financial interests; he works in an unpaid role as a clinical consultant with Xyla. EL reports personal fees and equity from Xyla. MS reports personal fees from Xyla and serves as an intern at Microsoft Research. LAC reports travel support from Australia New Zealand College of Intensive Care Medicine, cloud credits from Oracle, Amazon, and Google, and a role as Editor-in-Chief of PLOS Digital Health. JG reports support from the US National Science Foundation (grant #1928481), Radiological Society of North America (grant #EIHD2204), National Institutes of Health (grants 75N92020C00008 and 75N920), AIM-AHEAD, DeepLook, Clarity consortium, and GE Edison; received honoraria from the National Bureau of Economic Research; and has leadership roles with SIIM, HL7, and the ACR Advisory Committee. R-EEA is an employee of Massachusetts Medical Society, which owns NEJM Healer (NEJM Healer cases were used in the study). DWB reports grants and personal fees from EarlySense; personal fees from CDI Negev; equity from ValeraHealth, Clew, MDClone, and Guided Clinical Solutions; personal fees and equity from AESOP and Feelbetter; and grants from IBM Watson Health, outside the submitted work. DWB also has a patent pending (PHC-028564US PCT) on intraoperative clinical decision support. AJB is a cofounder and consultant to Personalis and NuMedii; consultant to Mango Tree Corporation and in the recent past, to Samsung, 10x Genomics, Helix, Pathway Genomics, and Verinata (Illumina); has served on paid advisory panels or boards for Geisinger Health, Regenstrief Institute, Gerson Lehman Group, AlphaSights, Covance, Novartis, Genentech, Merck, and Roche; is a shareholder in Personalis and NuMedii; is a minor shareholder in Apple, Meta (Facebook), Alphabet (Google), Microsoft, Amazon, Snap, 10x Genomics, Illumina, Regeneron, Sanofi, Pfizer, Royalty Pharma, Moderna, Sutro, Doximity, BioNtech, Invitae, Pacific Biosciences, Editas Medicine, Nuna Health, Assay Depot, Vet24seven, and several other non-health related companies and mutual funds; and has received honoraria and travel reimbursement for invited talks from Johnson & Johnson, Roche, Genentech, Pfizer, Merck, Lilly, Takeda, Varian, Mars, Siemens, Optum, Abbott, Celgene, AstraZeneca, AbbVie, Westat, and many academic institutions, medical or disease specific foundations and associations, and health systems. AJB also receives royalty payments through Stanford University for several patents and other disclosures licensed to NuMedii and Personalis. AJB's research has been funded by the National Institutes of Health, Peraton (as the prime on a National Institutes of Health contract), Genentech, Johnson & Johnson, US Food and Drug Administration, Robert Wood Johnson Foundation, Leon Lowenstein Foundation, Intervalien Foundation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, and in the recent past, the March of Dimes, Juvenile Diabetes Research Foundation, California Governor's Office of Planning and Research, California Institute for Regenerative Medicine, L’Oreal, and Progenity. EA reports personal fees from Canopy Innovations, Fourier Health, and Xyla; and grants from Microsoft Research. None of these entities had any role in the design, execution, evaluation, or writing of this manuscript. All other authors declare no competing interests.

Comment in

Preventing harm from non-conscious bias in medical generative AI.
Hastings J. Hastings J. Lancet Digit Health. 2024 Jan;6(1):e2-e3. doi: 10.1016/S2589-7500(23)00246-7. Lancet Digit Health. 2024. PMID: 38123253 No abstract available.
Migration background, skin colour, gender, and infectious disease presentation in clinical vignettes.
Lohse Y, Last K, Darici D, Becker SL, Papan C. Lohse Y, et al. Lancet Digit Health. 2024 Aug;6(8):e539-e540. doi: 10.1016/S2589-7500(24)00112-2. Lancet Digit Health. 2024. PMID: 39059884 No abstract available.

References

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- ClinicalKey
- Elsevier Science
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

Affiliations

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Comment in

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical