DEVELOPMENT AND PERFORMANCE OF TEXT-MINING ALGORITHMS TO EXTRACT SOCIOECONOMIC STATUS FROM DE-IDENTIFIED ELECTRONIC HEALTH RECORDS

Brittany M Hollister¹, Nicole A Restrepo, Eric Farber-Eger, Dana C Crawford, Melinda C Aldrich, Amy Non

Affiliations

Affiliation

¹ Vanderbilt Genetics Institute, Vanderbilt University, 519 Light Hall, 2215 Garland Ave. South, Nashville, TN, 37232, USA, Brittany.M.Hollister@Vanderbilt.edu.

PMID: 27896978
PMCID: PMC5147499
DOI: 10.1142/9789813207813_0023

DEVELOPMENT AND PERFORMANCE OF TEXT-MINING ALGORITHMS TO EXTRACT SOCIOECONOMIC STATUS FROM DE-IDENTIFIED ELECTRONIC HEALTH RECORDS

Brittany M Hollister et al. Pac Symp Biocomput. 2017.

. 2017:22:230-241.

doi: 10.1142/9789813207813_0023.

Authors

Brittany M Hollister¹, Nicole A Restrepo, Eric Farber-Eger, Dana C Crawford, Melinda C Aldrich, Amy Non

Affiliation

¹ Vanderbilt Genetics Institute, Vanderbilt University, 519 Light Hall, 2215 Garland Ave. South, Nashville, TN, 37232, USA, Brittany.M.Hollister@Vanderbilt.edu.

PMID: 27896978
PMCID: PMC5147499
DOI: 10.1142/9789813207813_0023

Abstract

Socioeconomic status (SES) is a fundamental contributor to health, and a key factor underlying racial disparities in disease. However, SES data are rarely included in genetic studies due in part to the difficultly of collecting these data when studies were not originally designed for that purpose. The emergence of large clinic-based biobanks linked to electronic health records (EHRs) provides research access to large patient populations with longitudinal phenotype data captured in structured fields as billing codes, procedure codes, and prescriptions. SES data however, are often not explicitly recorded in structured fields, but rather recorded in the free text of clinical notes and communications. The content and completeness of these data vary widely by practitioner. To enable gene-environment studies that consider SES as an exposure, we sought to extract SES variables from racial/ethnic minority adult patients (n=9,977) in BioVU, the Vanderbilt University Medical Center biorepository linked to de-identified EHRs. We developed several measures of SES using information available within the de-identified EHR, including broad categories of occupation, education, insurance status, and homelessness. Two hundred patients were randomly selected for manual review to develop a set of seven algorithms for extracting SES information from de-identified EHRs. The algorithms consist of 15 categories of information, with 830 unique search terms. SES data extracted from manual review of 50 randomly selected records were compared to data produced by the algorithm, resulting in positive predictive values of 80.0% (education), 85.4% (occupation), 87.5% (unemployment), 63.6% (retirement), 23.1% (uninsured), 81.8% (Medicaid), and 33.3% (homelessness), suggesting some categories of SES data are easier to extract in this EHR than others. The SES data extraction approach developed here will enable future EHR-based genetic studies to integrate SES information into statistical analyses. Ultimately, incorporation of measures of SES into genetic studies will help elucidate the impact of the social environment on disease risk and outcomes.

PubMed Disclaimer

Figures

**Figure 1**
Overview of the development process for the SES algorithms

See this image and copyright information in PMC

References

1. Poverty: Assessing the Distribution of Health Risks by Socioeconomic Position at National and Local Levels. 2004.
1. Seeman T, et al. Social Science & Medicine. 2008;66:72–87. - PMC - PubMed
1. Braveman P, et al. Public Health Reports. 2014;129(Suppl 2):19–31. - PMC - PubMed
1. National Center for Health Statistics. Health, United States, 2011: With Special Feature on Socioeconomic Status and Health. 2012. - PubMed
1. Carrieri V, et al. Health Econ. 2016

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DEVELOPMENT AND PERFORMANCE OF TEXT-MINING ALGORITHMS TO EXTRACT SOCIOECONOMIC STATUS FROM DE-IDENTIFIED ELECTRONIC HEALTH RECORDS

Affiliation

DEVELOPMENT AND PERFORMANCE OF TEXT-MINING ALGORITHMS TO EXTRACT SOCIOECONOMIC STATUS FROM DE-IDENTIFIED ELECTRONIC HEALTH RECORDS

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources