Mitigating the risk of health inequity exacerbated by large language models

Yuelyu Ji¹, Wenhe Ma², Sonish Sivarajkumar³, Hang Zhang³, Eugene M Sadhu^{4

5}, Zhuochun Li¹, Xizhi Wu², Shyam Visweswaran^{4

5}, Yanshan Wang^{6

7

8

9

10}

Affiliations

¹ Department of Information Science, University of Pittsburgh, Pittsburgh, PA, USA.
² Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
³ Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.
⁴ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA.
⁶ Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
⁷ Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
⁸ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
⁹ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
¹⁰ University of Pittsburgh Medical Center, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.

PMID: 40319154
PMCID: PMC12049425
DOI: 10.1038/s41746-025-01576-4

Mitigating the risk of health inequity exacerbated by large language models

Yuelyu Ji et al. NPJ Digit Med. 2025.

. 2025 May 4;8(1):246.

doi: 10.1038/s41746-025-01576-4.

Authors

Yuelyu Ji¹, Wenhe Ma², Sonish Sivarajkumar³, Hang Zhang³, Eugene M Sadhu^{4

5}, Zhuochun Li¹, Xizhi Wu², Shyam Visweswaran^{4

5}, Yanshan Wang^{6

7

8

9

10}

Affiliations

¹ Department of Information Science, University of Pittsburgh, Pittsburgh, PA, USA.
² Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
³ Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.
⁴ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA.
⁶ Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
⁷ Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
⁸ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
⁹ Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.
¹⁰ University of Pittsburgh Medical Center, Pittsburgh, PA, USA. yanshan.wang@pitt.edu.

PMID: 40319154
PMCID: PMC12049425
DOI: 10.1038/s41746-025-01576-4

Abstract

Recent advancements in large language models (LLMs) have demonstrated their potential in numerous medical applications, particularly in automating clinical trial matching for translational research and enhancing medical question-answering for clinical decision support. However, our study shows that incorporating non-decisive socio-demographic factors, such as race, sex, income level, LGBT+ status, homelessness, illiteracy, disability, and unemployment, into the input of LLMs can lead to incorrect and harmful outputs. These discrepancies could worsen existing health disparities if LLMs are broadly implemented in healthcare. To address this issue, we introduce EquityGuard, a novel framework designed to detect and mitigate the risk of health inequities in LLM-based medical applications. Our evaluation demonstrates its effectiveness in promoting equitable outcomes across diverse populations.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Y.W. has ownership and equity in BonafideNLP, LLC, and S.V. has ownership and equity in Kvatchii, Ltd., READE.ai, Inc., and ThetaRho, Inc. The other authors declare no competing interests.

Figures

**Fig. 1. Inequities when applying LLMs to two major medical applications.**
Clinical Trial Matching (left) and Medical Question Answering (right). On the left, including race and sex information (e.g., “African-American” and “woman”) in the patient note, despite being irrelevant to matching the correct clinical trials, resulted in altered clinical trial recommendations generated by the LLMs. On the right, adding race information (e.g., “Native American”) to the question, which should not affect the response, led to incorrect answers from the LLMs. These examples show that non-decisive socio-demographic factors in different patient populations can lead to incorrect outputs from LLMs, which may lead to harmful clinical outcomes to these patient populations and eventually exacerbate healthcare inequities.

**Fig. 2. Performance of various LLMs when specific SDOH factors were introduced into the dataset.**
The clinical trial matching (CTM) performance is measured using NDCG@10 (higher is better), while the medical question answering (MQA) performance is measured using error rate (lower is better). SDOH factors include race, sex, low income, LGBT+ status, homelessness, illiteracy, disability, and unemployment. Each sensitive attribute was incorporated into the input data for both CTM and MQA tasks during the evaluation.

**Fig. 3. Fairness metrics including equal opportunity (EO) and demographic parity (DP) to assess equity in LLMs.**
Higher EO and DP scores indicate better equity, with EO focusing on ensuring equal positive outcomes for qualified individuals across groups and DP evaluating overall equity across all groups.

**Fig. 4. Correlation heatmaps of inequity categories in CTM and MQA tasks.**
The left plot shows the correlation between inequity categories in CTM tasks, illustrating how different inequity-modified queries resulted in similar trial rankings or selections by the models. The right plot shows the correlation between inequity categories in MQA tasks, displaying how often different inequity-modified queries led to the same answers or error patterns. These heatmaps help analyze how inequities across categories are interconnected, impacting model fairness.

**Fig. 5. Equal opportunity (EO) and demographic parity (DP) metrics for LLaMA3 8B models.**
Models trained with EquityGuard (`w/EquityGuard̀) show reduced EO and DP differences, indicating enhanced fairness.

**Fig. 6**
An overview of the EquityGuard framework for inequity detection and correction.

See this image and copyright information in PMC

References

1. Achiam, J. et al. Gpt-4 technical report. arXivhttps://arxiv.org/abs/2303.08774 (2023).
1. Dubey, A. et al. The llama 3 herd of models. arXivhttps://arxiv.org/abs/2407.21783 (2024).
1. Grosse, R. et al. Studying large language model generalization with influence functions. arXivhttps://arxiv.org/abs/2308.03296 (2023).
1. Benary, M. et al. Leveraging large language models for decision support in personalized oncology. JAMA Network Open6, e2343689–e2343689 (2023). - PMC - PubMed
1. Zhou, L. et al. Larger and more instructable language models become less reliable. Nature634, 61–68 (2024). - PMC - PubMed

Grants and funding

U24 TR004111/TR/NCATS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Mitigating the risk of health inequity exacerbated by large language models

Affiliations

Mitigating the risk of health inequity exacerbated by large language models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources