Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals

Pedro L Teixeira¹, Wei-Qi Wei¹, Robert M Cronin¹, Huan Mo¹, Jacob P VanHouten^{1

2}, Robert J Carroll¹, Eric LaRose³, Lisa A Bastarache¹, S Trent Rosenbloom^{1

4}, Todd L Edwards¹, Dan M Roden^{4

5}, Thomas A Lasko¹, Richard A Dart⁶, Anne M Nikolai³, Peggy L Peissig³, Joshua C Denny^{7

4}

Affiliations

¹ Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA.
² Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, USA.
³ Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, 1000 N Oak Ave - ML8, Marshfield, WI 54449, USA.
⁴ Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA.
⁵ Department of Pharmacology, Vanderbilt University School of Medicine, Nashville, TN, USA.
⁶ Center for Human Genetics, Marshfield Clinic Research Foundation, 1000 N Oak Ave-MLR, Marshfield, WI 54449, USA.
⁷ Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA josh.denny@vanderbilt.edu.

PMID: 27497800
PMCID: PMC5201185
DOI: 10.1093/jamia/ocw071

Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals

Pedro L Teixeira et al. J Am Med Inform Assoc. 2017 Jan.

. 2017 Jan;24(1):162-171.

doi: 10.1093/jamia/ocw071. Epub 2016 Aug 7.

Authors

Affiliations

¹ Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA.
² Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, USA.
³ Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, 1000 N Oak Ave - ML8, Marshfield, WI 54449, USA.
⁴ Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA.
⁵ Department of Pharmacology, Vanderbilt University School of Medicine, Nashville, TN, USA.
⁶ Center for Human Genetics, Marshfield Clinic Research Foundation, 1000 N Oak Ave-MLR, Marshfield, WI 54449, USA.
⁷ Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA josh.denny@vanderbilt.edu.

PMID: 27497800
PMCID: PMC5201185
DOI: 10.1093/jamia/ocw071

Abstract

Objective: Phenotyping algorithms applied to electronic health record (EHR) data enable investigators to identify large cohorts for clinical and genomic research. Algorithm development is often iterative, depends on fallible investigator intuition, and is time- and labor-intensive. We developed and evaluated 4 types of phenotyping algorithms and categories of EHR information to identify hypertensive individuals and controls and provide a portable module for implementation at other sites.

Materials and methods: We reviewed the EHRs of 631 individuals followed at Vanderbilt for hypertension status. We developed features and phenotyping algorithms of increasing complexity. Input categories included International Classification of Diseases, Ninth Revision (ICD9) codes, medications, vital signs, narrative-text search results, and Unified Medical Language System (UMLS) concepts extracted using natural language processing (NLP). We developed a module and tested portability by replicating 10 of the best-performing algorithms at the Marshfield Clinic.

Results: Random forests using billing codes, medications, vitals, and concepts had the best performance with a median area under the receiver operator characteristic curve (AUC) of 0.976. Normalized sums of all 4 categories also performed well (0.959 AUC). The best non-NLP algorithm combined normalized ICD9 codes, medications, and blood pressure readings with a median AUC of 0.948. Blood pressure cutoffs or ICD9 code counts alone had AUCs of 0.854 and 0.908, respectively. Marshfield Clinic results were similar.

Conclusion: This work shows that billing codes or blood pressure readings alone yield good hypertension classification performance. However, even simple combinations of input categories improve performance. The most complex algorithms classified hypertension with excellent recall and precision.

Keywords: electronic health records; hypertension; machine learning; natural language processing; phenotyping algorithms; random forests.

PubMed Disclaimer

Figures

**Figure 1.**
Algorithm dataset generation flowchart. We randomly sampled 631 adults for the initial population. We limited sampling to concepts that were in high-yield sections, which included “history of present illness,” “past medical history,” and “assessment and plan.” Billing codes were available as structured data, and hypertension-related codes were physician-curated. We also separated inpatient and outpatient vitals using Current Procedural Terminology (CPT) codes.

**Figure 2.**
Random forests trained on combinations of categories perform best. We did 1000-iteration bootstrap runs for each category of features as well as increasingly comprehensive combinations of categories for successively larger training set sizes from 25 to 600. Labels indicate the set of categories used for each learning curve. Other combinations were tested but were similar to the included examples. The graph below includes the median AUC for each learning curve in addition to the upper and lower bounds of the 95% confidence interval. For reference, lines representing the median AUC for 2 simple methods are included—hypertension (HTN) ICD9 counts and the sum of unique normalized ICD9 codes, medications, blood pressure (BP) readings, and regular expression (RegEx) matches normalized by document counts.

**Figure 3.**
Algorithm performance. Median AUC and 95% confidence intervals (CI) for the 1000-iteration bootstrap are depicted across all random forests, representative simple algorithms, and representative individual features. Diamonds indicate the AUC and dashes indicate the upper and lower bounds of the 95% CI, respectively. The top 6 by median AUC are statistically significantly better than the lower 41 of the 56 total included—comparing 95% CI.

**Figure 4.**
Combination methods achieve the highest AUC. We include the ROC representative of the 50th percentile 1000 iteration bootstrap run below. Numbers in parentheses represent the median AUCs from the bootstrap model. The random forest model represented here is the best-performing RF model from Figure 2. The best simple algorithm is the sum of unique normalized hypertension ICD9, medications, blood pressures, and regular expression matches normalized by the number of documents.

**Figure 5.**
Histogram showing prediction separation between cases and controls. The top column segments, biased toward the right (1.0) are the counts of hypertensive individuals with a mean random forest prediction (each taken from a test set not used for training) within the bin range listed along the x-axis. The bottom column segments represent the counts of controls in each bin range. Individuals with an unexpected score (< 0.5 for cases, > 0.5 for controls) were reviewed.

See this image and copyright information in PMC

References

1. Yoon SS, Gu Q, Nwankwo T, Wright JD, Hong Y, Burt V. Trends in blood pressure among adults with hypertension: United States, 2003 to 2012. Hypertension. 2015;65(1):54–61. http://hyper.ahajournals.org.proxy.library.vanderbilt.edu/content/65/1/54. Accessed July 8, 2015. - PMC - PubMed
1. Mozaffarian D, Benjamin EJ, Go AS, et al. Heart disease and stroke statistics-2015 update: a report from the American Heart Association. Circulation. 2014;131(4):e29–e322. http://www.ncbi.nlm.nih.gov/pubmed/25520374. Accessed December 19, 2014. - PubMed
1. Cutler JA, Sorlie PD, Wolz M, Thom T, Fields LE, Roccella EJ. Trends in hypertension prevalence, awareness, treatment, and control rates in United States adults between 1988-1994 and 1999-2004. Hypertension. 2008;52(5):818–827. http://www.ncbi.nlm.nih.gov/pubmed/18852389. Accessed October 1, 2013. - PubMed
1. WHO ISH Writing Group. 2003. World Health Organization (WHO) and Internal Society of Hypertension (ISH) statemnt on management of hypertension - WHO, ISH Writing Group 2003.pdf. 2003.
1. Myers MG. A proposed algorithm for diagnosing hypertension using automated office blood pressure measurement. J Hypertens. 2010;28(4): 703–708. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals

Affiliations

Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical