Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 4;8(1):246.
doi: 10.1038/s41746-025-01576-4.

Mitigating the risk of health inequity exacerbated by large language models

Affiliations

Mitigating the risk of health inequity exacerbated by large language models

Yuelyu Ji et al. NPJ Digit Med. .

Abstract

Recent advancements in large language models (LLMs) have demonstrated their potential in numerous medical applications, particularly in automating clinical trial matching for translational research and enhancing medical question-answering for clinical decision support. However, our study shows that incorporating non-decisive socio-demographic factors, such as race, sex, income level, LGBT+ status, homelessness, illiteracy, disability, and unemployment, into the input of LLMs can lead to incorrect and harmful outputs. These discrepancies could worsen existing health disparities if LLMs are broadly implemented in healthcare. To address this issue, we introduce EquityGuard, a novel framework designed to detect and mitigate the risk of health inequities in LLM-based medical applications. Our evaluation demonstrates its effectiveness in promoting equitable outcomes across diverse populations.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Y.W. has ownership and equity in BonafideNLP, LLC, and S.V. has ownership and equity in Kvatchii, Ltd., READE.ai, Inc., and ThetaRho, Inc. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Inequities when applying LLMs to two major medical applications.
Clinical Trial Matching (left) and Medical Question Answering (right). On the left, including race and sex information (e.g., “African-American” and “woman”) in the patient note, despite being irrelevant to matching the correct clinical trials, resulted in altered clinical trial recommendations generated by the LLMs. On the right, adding race information (e.g., “Native American”) to the question, which should not affect the response, led to incorrect answers from the LLMs. These examples show that non-decisive socio-demographic factors in different patient populations can lead to incorrect outputs from LLMs, which may lead to harmful clinical outcomes to these patient populations and eventually exacerbate healthcare inequities.
Fig. 2
Fig. 2. Performance of various LLMs when specific SDOH factors were introduced into the dataset.
The clinical trial matching (CTM) performance is measured using NDCG@10 (higher is better), while the medical question answering (MQA) performance is measured using error rate (lower is better). SDOH factors include race, sex, low income, LGBT+ status, homelessness, illiteracy, disability, and unemployment. Each sensitive attribute was incorporated into the input data for both CTM and MQA tasks during the evaluation.
Fig. 3
Fig. 3. Fairness metrics including equal opportunity (EO) and demographic parity (DP) to assess equity in LLMs.
Higher EO and DP scores indicate better equity, with EO focusing on ensuring equal positive outcomes for qualified individuals across groups and DP evaluating overall equity across all groups.
Fig. 4
Fig. 4. Correlation heatmaps of inequity categories in CTM and MQA tasks.
The left plot shows the correlation between inequity categories in CTM tasks, illustrating how different inequity-modified queries resulted in similar trial rankings or selections by the models. The right plot shows the correlation between inequity categories in MQA tasks, displaying how often different inequity-modified queries led to the same answers or error patterns. These heatmaps help analyze how inequities across categories are interconnected, impacting model fairness.
Fig. 5
Fig. 5. Equal opportunity (EO) and demographic parity (DP) metrics for LLaMA3 8B models.
Models trained with EquityGuard (`w/EquityGuard̀) show reduced EO and DP differences, indicating enhanced fairness.
Fig. 6
Fig. 6
An overview of the EquityGuard framework for inequity detection and correction.

References

    1. Achiam, J. et al. Gpt-4 technical report. arXivhttps://arxiv.org/abs/2303.08774 (2023).
    1. Dubey, A. et al. The llama 3 herd of models. arXivhttps://arxiv.org/abs/2407.21783 (2024).
    1. Grosse, R. et al. Studying large language model generalization with influence functions. arXivhttps://arxiv.org/abs/2308.03296 (2023).
    1. Benary, M. et al. Leveraging large language models for decision support in personalized oncology. JAMA Network Open6, e2343689–e2343689 (2023). - PMC - PubMed
    1. Zhou, L. et al. Larger and more instructable language models become less reliable. Nature634, 61–68 (2024). - PMC - PubMed

LinkOut - more resources