Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions

doi:10.1101/2024.05.21.24307726

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 May 22:2024.05.21.24307726.

doi: 10.1101/2024.05.21.24307726.

Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions

Vipina K Keloth¹, Salih Selek², Qingyu Chen¹, Christopher Gilman¹, Sunyang Fu³, Yifang Dang³, Xinghan Chen⁴, Xinyue Hu⁵, Yujia Zhou¹, Huan He¹, Jungwei W Fan⁶, Karen Wang^{1

7}, Cynthia Brandt¹, Cui Tao^{5

6}, Hongfang Liu³, Hua Xu¹

Affiliations

¹ Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA.
² Department of Psychiatry and Behavioral Sciences, UTHealth McGovern Medical School, Houston, TX, USA.
³ McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁴ School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁵ Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, USA.
⁶ Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA.
⁷ Equity Research and Innovation Center, Yale School of Medicine, New Haven, CT, USA.

PMID: 38826441
PMCID: PMC11142292
DOI: 10.1101/2024.05.21.24307726

Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions

Vipina K Keloth et al. medRxiv. 2024.

[Preprint]. 2024 May 22:2024.05.21.24307726.

doi: 10.1101/2024.05.21.24307726.

Authors

Affiliations

¹ Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA.
² Department of Psychiatry and Behavioral Sciences, UTHealth McGovern Medical School, Houston, TX, USA.
³ McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁴ School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA.
⁵ Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, USA.
⁶ Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA.
⁷ Equity Research and Innovation Center, Yale School of Medicine, New Haven, CT, USA.

PMID: 38826441
PMCID: PMC11142292
DOI: 10.1101/2024.05.21.24307726

Update in

Social determinants of health extraction from clinical notes across institutions using large language models.
Keloth VK, Selek S, Chen Q, Gilman C, Fu S, Dang Y, Chen X, Hu X, Zhou Y, He H, Fan JW, Wang K, Brandt C, Tao C, Liu H, Xu H. Keloth VK, et al. NPJ Digit Med. 2025 May 17;8(1):287. doi: 10.1038/s41746-025-01645-8. NPJ Digit Med. 2025. PMID: 40379919 Free PMC article.

Abstract

The consistent and persuasive evidence illustrating the influence of social determinants on health has prompted a growing realization throughout the health care sector that enhancing health and health equity will likely depend, at least to some extent, on addressing detrimental social determinants. However, detailed social determinants of health (SDoH) information is often buried within clinical narrative text in electronic health records (EHRs), necessitating natural language processing (NLP) methods to automatically extract these details. Most current NLP efforts for SDoH extraction have been limited, investigating on limited types of SDoH elements, deriving data from a single institution, focusing on specific patient cohorts or note types, with reduced focus on generalizability. This study aims to address these issues by creating cross-institutional corpora spanning different note types and healthcare systems, and developing and evaluating the generalizability of classification models, including novel large language models (LLMs), for detecting SDoH factors from diverse types of notes from four institutions: Harris County Psychiatric Center, University of Texas Physician Practice, Beth Israel Deaconess Medical Center, and Mayo Clinic. Four corpora of deidentified clinical notes were annotated with 21 SDoH factors at two levels: level 1 with SDoH factor types only and level 2 with SDoH factors along with associated values. Three traditional classification algorithms (XGBoost, TextCNN, Sentence BERT) and an instruction tuned LLM-based approach (LLaMA) were developed to identify multiple SDoH factors. Substantial variation was noted in SDoH documentation practices and label distributions based on patient cohorts, note types, and hospitals. The LLM achieved top performance with micro-averaged F1 scores over 0.9 on level 1 annotated corpora and an F1 over 0.84 on level 2 annotated corpora. While models performed well when trained and tested on individual datasets, cross-dataset generalization highlighted remaining obstacles. To foster collaboration, access to partial annotated corpora and models trained by merging all annotated datasets will be made available on the PhysioNet repository.

Keywords: Social determinants of health; electronic health records; large language models; multi-label classification.

PubMed Disclaimer

Conflict of interest statement

Competing interests All authors declare no financial or non-financial competing interests.

Figures

**Figure 1:**
A schematic representation of the study workflow.

**Figure 2:**
Distribution of SDoH factors for all four datasets (level 1 annotation).

**Figure 3.**
Heatmap of cross-dataset performance evaluation showing micro-averaged F1scores for all models on level 1 annotated corpora.

**Figure 4.**
Heatmap of cross-dataset performance evaluation showing micro-averaged F1scores for all models on level 2 annotated corpora.

See this image and copyright information in PMC

References

1. Galea S, Tracy M, Hoggatt KJ, DiMaggio C, Karpati A. Estimated deaths attributable to social factors in the United States. American journal of public health 2011;101(8):1456–65 - PMC - PubMed
1. Marmot M, Friel S, Bell R, Houweling TA, Taylor S, Health CoSDo. Closing the gap in a generation: health equity through action on the social determinants of health. The lancet 2008;372(9650):1661–69 - PubMed
1. Singh GK, Siahpush M, Kogan MD. Neighborhood socioeconomic conditions, built environments, and childhood obesity. Health affairs 2010;29(3):503–12 - PubMed
1. Felitti VJ, Anda RF, Nordenberg D, et al. Relationship of childhood abuse and household dysfunction to many of the leading causes of death in adults: The Adverse Childhood Experiences (ACE) Study. American journal of preventive medicine 1998;14(4):245–58 - PubMed
1. Gold R, Gottlieb L. National data on social risk screening underscore the need for implementation research. JAMA network open 2019;2(9):e1911513–e13 - PMC - PubMed

Publication types

Actions

Grants and funding

LinkOut - more resources

Full Text Sources

[1] Galea S, Tracy M, Hoggatt KJ, DiMaggio C, Karpati A. Estimated deaths attributable to social factors in the United States. American journal of public health 2011;101(8):1456–65 - PMC - PubMed

[2] Galea S, Tracy M, Hoggatt KJ, DiMaggio C, Karpati A. Estimated deaths attributable to social factors in the United States. American journal of public health 2011;101(8):1456–65 - PMC - PubMed

[3] Marmot M, Friel S, Bell R, Houweling TA, Taylor S, Health CoSDo. Closing the gap in a generation: health equity through action on the social determinants of health. The lancet 2008;372(9650):1661–69 - PubMed

[4] Marmot M, Friel S, Bell R, Houweling TA, Taylor S, Health CoSDo. Closing the gap in a generation: health equity through action on the social determinants of health. The lancet 2008;372(9650):1661–69 - PubMed

[5] Singh GK, Siahpush M, Kogan MD. Neighborhood socioeconomic conditions, built environments, and childhood obesity. Health affairs 2010;29(3):503–12 - PubMed

[6] Singh GK, Siahpush M, Kogan MD. Neighborhood socioeconomic conditions, built environments, and childhood obesity. Health affairs 2010;29(3):503–12 - PubMed

[7] Felitti VJ, Anda RF, Nordenberg D, et al. Relationship of childhood abuse and household dysfunction to many of the leading causes of death in adults: The Adverse Childhood Experiences (ACE) Study. American journal of preventive medicine 1998;14(4):245–58 - PubMed

[8] Felitti VJ, Anda RF, Nordenberg D, et al. Relationship of childhood abuse and household dysfunction to many of the leading causes of death in adults: The Adverse Childhood Experiences (ACE) Study. American journal of preventive medicine 1998;14(4):245–58 - PubMed

[9] Gold R, Gottlieb L. National data on social risk screening underscore the need for implementation research. JAMA network open 2019;2(9):e1911513–e13 - PMC - PubMed

[10] Gold R, Gottlieb L. National data on social risk screening underscore the need for implementation research. JAMA network open 2019;2(9):e1911513–e13 - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions

Affiliations

Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

This is a preprint.

Update in

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources