Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec;30(12):3590-3600.
doi: 10.1038/s41591-024-03258-2. Epub 2024 Sep 23.

A toolbox for surfacing health equity harms and biases in large language models

Affiliations

A toolbox for surfacing health equity harms and biases in large language models

Stephen R Pfohl et al. Nat Med. 2024 Dec.

Abstract

Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.

PubMed Disclaimer

Conflict of interest statement

Competing interests: This study was funded by Google LLC and/or its subsidiary (Google). S.R.P., H.C.-L., R.S., D.N., M.A., A. Dieng, N.T., Q.M.R., S.A., N.R., Y.L., M.S., A.W., A.P., C.N., P.S., A. Dewitt, P.M., S.P., K.H., A.K., C.S., J.B., G.C., Y.M., J.S.-L., I.H. and K.S. are employees of Google and may own stock as a part of a standard compensation package. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of our main contributions.
We employ an iterative, participatory approach to design human assessment rubrics for surfacing health equity harms and biases; introduce EquityMedQA, a collection of seven newly released adversarial medical question-answering datasets enriched for equity-related content that substantially expands upon the volume and breadth of previously studied adversarial data for medical question answering; and perform a large scale empirical study of health equity-related biases in LLMs.
Fig. 2
Fig. 2. Results of independent evaluation of bias in Med-PaLM 2 answers.
We report the rate at which raters reported minor or severe bias in Med-PaLM 2 answers for physician and health equity expert raters for each dataset and dimension of bias. The numbers of answers rated for each dataset are reported in Table 2 and the Methods. Statistics for multiply rated datasets (Mixed MMQA–OMAQ and Omiye et al.) were computed with pooling over replicates with the level of replication indicated in parentheses. Data are reported as proportions with 95% CIs.
Fig. 3
Fig. 3. Results of pairwise evaluation of Med-PaLM 2 answers compared to Med-PaLM and physician answers.
We report the rates at which raters reported a lesser degree of bias in Med-PaLM 2 answers versus comparator answers across datasets, rater types and dimensions of bias. The numbers of answers rated for each dataset are reported in Table 2 and the Methods. The comparator is Med-PaLM in all cases except for the case of physician-written answers to HealthSearchQA questions. Data are reported as proportions with 95% CIs.
Fig. 4
Fig. 4. Results of counterfactual and independent evaluation on counterfactual datasets.
In the top four rows, we report the rates at which raters reported bias in counterfactual pairs using the proposed counterfactual rubric as well as the rates at which they reported bias in one, one or more or both of the answers using the independent evaluation rubric for the CC-Manual (n = 102 pairs, triple replication) and the CC-LLM datasets (n = 200 pairs). For comparison, the bottom row reports independent evaluation results aggregated across all unpaired questions for the CC-Manual (n = 42) and CC-LLM (n = 100) datasets. Data are reported as proportions with 95% CIs.
Extended Data Fig. 1
Extended Data Fig. 1. Summary of pairwise comparisons of counterfactual answers.
Results are pooled over the CC-Manual (n = 102 pairs, triple replicaIon) and CC-LLM (n = 200 pairs) datasets. Top, the counts of the number of pairs of Med-PaLM 2 answers to counterfactual questions reported for each category of answer similarity, stratified by rater group and whether the ideal answers to the counterfactual quesIons were reported to be (left) the same or (right) different. Bottom, the rate that counterfactual pairs are reported as containing bias for cases where the ideal answers were reported to be (left) the same and (right) different, stratified by rater group and the reported category of answer similarity. Data are reported as proporIons with 95% confidence intervals.
Extended Data Fig. 2
Extended Data Fig. 2. Evaluation of answers to the questions from Omiye et al..
In (A) we show the number of raters, out of five, that reported the presence of bias and its dimensions across the nine questions for the independent rubric. In (B), we show the results of pairwise evaluation of Med-PaLM and Med-PaLM 2 answers to the questions for each question and rater.

References

    1. Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med.3, 141 (2023). - PMC - PubMed
    1. Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review. Ann. Intern. Med.177, 210–220 (2024). - PubMed
    1. Singhal, K. et al. Large language models encode clinical knowledge. Nature620, 172–180 (2023). - PMC - PubMed
    1. Singhal, K. et al. Towards expert-level medical question answering with large language models. Preprint at https://arxiv.org/abs/2305.09617 (2023).
    1. Zakka, C. et al. Almanac — retrieval-augmented language models for clinical medicine. NEJM AI10.1056/aioa2300068 (2024). - PMC - PubMed

LinkOut - more resources