A toolbox for surfacing health equity harms and biases in large language models

Stephen R Pfohl^#¹, Heather Cole-Lewis^#², Rory Sayres³, Darlene Neal³, Mercy Asiedu³, Awa Dieng⁴, Nenad Tomasev⁴, Qazi Mamunur Rashid³, Shekoofeh Azizi⁴, Negar Rostamzadeh³, Liam G McCoy⁵, Leo Anthony Celi^{6

7

8}, Yun Liu³, Mike Schaekermann³, Alanna Walton⁴, Alicia Parrish⁴, Chirag Nagpal³, Preeti Singh³, Akeiylah Dewitt³, Philip Mansfield⁴, Sushant Prakash³, Katherine Heller³, Alan Karthikesalingam³, Christopher Semturs³, Joelle Barral⁴, Greg Corrado³, Yossi Matias³, Jamila Smith-Loud³, Ivor Horn³, Karan Singhal³

Affiliations

¹ Google Research, Mountain View, CA, USA. spfohl@google.com.
² Google Research, Mountain View, CA, USA. hcolelewis@google.com.
³ Google Research, Mountain View, CA, USA.
⁴ Google DeepMind, Mountain View, CA, USA.
⁵ University of Alberta, Edmonton, Alberta, Canada.
⁶ Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁷ Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.
⁸ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.

^# Contributed equally.

PMID: 39313595
PMCID: PMC11645264
DOI: 10.1038/s41591-024-03258-2

A toolbox for surfacing health equity harms and biases in large language models

Stephen R Pfohl et al. Nat Med. 2024 Dec.

. 2024 Dec;30(12):3590-3600.

doi: 10.1038/s41591-024-03258-2. Epub 2024 Sep 23.

Authors

Affiliations

¹ Google Research, Mountain View, CA, USA. spfohl@google.com.
² Google Research, Mountain View, CA, USA. hcolelewis@google.com.
³ Google Research, Mountain View, CA, USA.
⁴ Google DeepMind, Mountain View, CA, USA.
⁵ University of Alberta, Edmonton, Alberta, Canada.
⁶ Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁷ Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.
⁸ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.

^# Contributed equally.

PMID: 39313595
PMCID: PMC11645264
DOI: 10.1038/s41591-024-03258-2

Abstract

Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.

PubMed Disclaimer

Conflict of interest statement

Competing interests: This study was funded by Google LLC and/or its subsidiary (Google). S.R.P., H.C.-L., R.S., D.N., M.A., A. Dieng, N.T., Q.M.R., S.A., N.R., Y.L., M.S., A.W., A.P., C.N., P.S., A. Dewitt, P.M., S.P., K.H., A.K., C.S., J.B., G.C., Y.M., J.S.-L., I.H. and K.S. are employees of Google and may own stock as a part of a standard compensation package. The other authors declare no competing interests.

Figures

**Fig. 1. Overview of our main contributions.**
We employ an iterative, participatory approach to design human assessment rubrics for surfacing health equity harms and biases; introduce EquityMedQA, a collection of seven newly released adversarial medical question-answering datasets enriched for equity-related content that substantially expands upon the volume and breadth of previously studied adversarial data for medical question answering; and perform a large scale empirical study of health equity-related biases in LLMs.

**Fig. 2. Results of independent evaluation of bias in Med-PaLM 2 answers.**
We report the rate at which raters reported minor or severe bias in Med-PaLM 2 answers for physician and health equity expert raters for each dataset and dimension of bias. The numbers of answers rated for each dataset are reported in Table 2 and the Methods. Statistics for multiply rated datasets (Mixed MMQA–OMAQ and Omiye et al.) were computed with pooling over replicates with the level of replication indicated in parentheses. Data are reported as proportions with 95% CIs.

**Fig. 3. Results of pairwise evaluation of Med-PaLM 2 answers compared to Med-PaLM and physician answers.**
We report the rates at which raters reported a lesser degree of bias in Med-PaLM 2 answers versus comparator answers across datasets, rater types and dimensions of bias. The numbers of answers rated for each dataset are reported in Table 2 and the Methods. The comparator is Med-PaLM in all cases except for the case of physician-written answers to HealthSearchQA questions. Data are reported as proportions with 95% CIs.

**Fig. 4. Results of counterfactual and independent evaluation on counterfactual datasets.**
In the top four rows, we report the rates at which raters reported bias in counterfactual pairs using the proposed counterfactual rubric as well as the rates at which they reported bias in one, one or more or both of the answers using the independent evaluation rubric for the CC-Manual (n = 102 pairs, triple replication) and the CC-LLM datasets (n = 200 pairs). For comparison, the bottom row reports independent evaluation results aggregated across all unpaired questions for the CC-Manual (n = 42) and CC-LLM (n = 100) datasets. Data are reported as proportions with 95% CIs.

**Extended Data Fig. 1. Summary of pairwise comparisons of counterfactual answers.**
Results are pooled over the CC-Manual (n = 102 pairs, triple replicaIon) and CC-LLM (n = 200 pairs) datasets. Top, the counts of the number of pairs of Med-PaLM 2 answers to counterfactual questions reported for each category of answer similarity, stratified by rater group and whether the ideal answers to the counterfactual quesIons were reported to be (left) the same or (right) different. Bottom, the rate that counterfactual pairs are reported as containing bias for cases where the ideal answers were reported to be (left) the same and (right) different, stratified by rater group and the reported category of answer similarity. Data are reported as proporIons with 95% confidence intervals.

**Extended Data Fig. 2. Evaluation of answers to the questions from Omiye et al..**
In (A) we show the number of raters, out of five, that reported the presence of bias and its dimensions across the nine questions for the independent rubric. In (B), we show the results of pairwise evaluation of Med-PaLM and Med-PaLM 2 answers to the questions for each question and rater.

See this image and copyright information in PMC

References

1. Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med.3, 141 (2023). - PMC - PubMed
1. Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review. Ann. Intern. Med.177, 210–220 (2024). - PubMed
1. Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). - PMC - PubMed
1. Singhal, K. et al. Towards expert-level medical question answering with large language models. Preprint at https://arxiv.org/abs/2305.09617 (2023).
1. Zakka, C. et al. Almanac — retrieval-augmented language models for clinical medicine. NEJM AI10.1056/aioa2300068 (2024). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A toolbox for surfacing health equity harms and biases in large language models

Affiliations

A toolbox for surfacing health equity harms and biases in large language models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources