Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 11;25(1):274.
doi: 10.1186/s12911-025-03118-0.

Evaluating gender bias in large language models in long-term care

Affiliations

Evaluating gender bias in large language models in long-term care

Sam Rickman. BMC Med Inform Decis Mak. .

Abstract

Background: Large language models (LLMs) are being used to reduce the administrative burden in long-term care by automatically generating and summarising case notes. However, LLMs can reproduce bias in their training data. This study evaluates gender bias in summaries of long-term care records generated with two state-of-the-art, open-source LLMs released in 2024: Meta's Llama 3 and Google Gemma.

Methods: Gender-swapped versions were created of long-term care records for 617 older people from a London local authority. Summaries of male and female versions were generated with Llama 3 and Gemma, as well as benchmark models from Meta and Google released in 2019: T5 and BART. Counterfactual bias was quantified through sentiment analysis alongside an evaluation of word frequency and thematic patterns.

Results: The benchmark models exhibited some variation in output on the basis of gender. Llama 3 showed no gender-based differences across any metrics. Gemma displayed the most significant gender-based differences. Male summaries focus more on physical and mental health issues. Language used for men was more direct, with women's needs downplayed more often than men's.

Conclusion: Care services are allocated on the basis of need. If women's health issues are underemphasised, this may lead to gender-based disparities in service receipt. LLMs may offer substantial benefits in easing administrative burden. However, the findings highlight the variation in state-of-the-art LLMs, and the need for evaluation of bias. The methods in this paper provide a practical framework for quantitative evaluation of gender bias in LLMs. The code is available on GitHub.

Keywords: Bias; Gender; LLMs; Long-term care.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: This study uses secondary data from administrative records, which were pseudonymised prior to egress to remove identifiable personal information (e.g., names, addresses, NHS numbers, and other unique identifiers). According to the UK General Data Protection Regulation (GDPR), processing of these data was conducted under the legal basis of legitimate interests, which does not require individual opt-in consent. This study was conducted in accordance with the principles of the Declaration of Helsinki. It involved the use of secondary data only, with no direct contact with participants. The data were pseudonymised prior to access and processed in line with established ethical standards for research using routinely collected health and social care records. Individual informed consent was not required, as the project involved no automated decision-making and used pseudonymised data throughout. Ethics approval for the project was granted by the LSE Personal Social Services Research Unit’s ethics committee on 30th May 2019, in compliance with the LSE’s Research Ethics Policy. A Data Processing Impact Assessment was completed, and the details of the project were made publicly available via a Privacy Notice on the local authority’s website, with local opt-out options provided. Approval was also granted by the NHS Confidentiality Advisory Group (CAG) in June 2020 (reference number 20/CAG/0043), with annual renewal. Consent for publication: Not applicable. This study does not include any individual-level identifying images, names, addresses, locations, or other information that could compromise participant anonymity. All data used in the study were pseudonymised prior to access, and no direct contact with participants occurred. Competing interests: The author declares no competing interests.

References

    1. Local Government Assocation. Artificial intelligence use cases, 2024. https://web.archive.org/web/20240904192138/https://www.local.gov.uk/our-.... [Accessed: 2024-09-04].
    1. Local Government: State of the Sector: AI Research Report. Technical report, local government association. 2024. https://web.archive.org/web/20240906174435/https://www.local.gov.uk/site.... [Accessed: 2024-09-06].
    1. Google Cloud. MedLM: Generative AI fine-tuned for the healthcare industry, 2024. https://web.archive.org/web/20240804062023/https://cloud.google.com/blog.... [Accessed: 2024-09-01].
    1. Lillis T, Leedham M, Twiner A. Time, the written record, and professional practice: the case of contemporary social work. Writ Commun. 2020;37:431–86. 10.1177/0741088320938804.
    1. Arndt BG, Beasley JW, Watkinson MD, Temte JL, Tuan W-J, Sinsky CA, Gilchrist VJ. Tethered to the EHR: primary care physician workload assessment using EHR event log data and time-motion observations. Ann Fam Med. 2017;15(5):419–26. - PMC - PubMed

LinkOut - more resources