Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec:156:104588.
doi: 10.1016/j.ijmedinf.2021.104588. Epub 2021 Sep 21.

Subcategorizing EHR diagnosis codes to improve clinical application of machine learning models

Affiliations

Subcategorizing EHR diagnosis codes to improve clinical application of machine learning models

Andrew P Reimer et al. Int J Med Inform. 2021 Dec.

Abstract

Background: Electronic health record (EHR) data is commonly used for secondary purposes such as research and clinical decision support. However, reuse of EHR data presents several challenges including but not limited to identifying all diagnoses associated with a patient's clinical encounter. The purpose of this study was to assess the feasibility of developing a schema to identify and subclassify all structured diagnosis codes for a patient encounter.

Methods: To develop a subclassification schema we used EHR data from an interhospital transport data repository that contained complete hospital encounter level data. Eight discrete data sources containing structured diagnosis codes were identified. Diagnosis codes were normalized using the Unified Medical Language System and additional EHR data were combined with standardized terminologies to create and validate the subcategories. We then employed random forest to assess the usefulness of the new subcategorized diagnoses to predict post-interhospital transfer mortality by building 2 models, one using standard diagnosis codes, and one using the new subcategorized diagnosis codes.

Results: Six subcategories of diagnoses were identified and validated. The subcategories included: primary or admitting diagnoses (10%), past medical, surgical or social history (9%), problem list (20%), comorbidity (24%), discharge diagnoses (6%), and unmapped diagnoses (31%). The subcategorized model outperformed the standard model, achieving a training AUROC of 0.97 versus 0.95 and testing model AUROC of 0.81 versus 0.46.

Discussion: Our work demonstrates that merging structured diagnosis codes with additional EHR data and secondary data sources provides additional information to understand the role of diagnosis throughout a clinical encounter and improves predictive model performance. Further work is necessary to assess if subcategorizing produces benefits in interpreting the results of prognostic models and/or operationalizing the results in clinical decision support applications.

Keywords: Data management; Electronic data processing; Electronic health records; Machine learning.

PubMed Disclaimer

Conflict of interest statement

Competing Interests Statement

The authors have no competing interests to declare.

Figures

Figure 1.
Figure 1.
Diagnosis subcategories, data sources and rules
Figure 2A.
Figure 2A.
Random Forest variable importance plot by diagnosis subcategory: all variables
Figure 2B.
Figure 2B.
Random Forest variable importance plot: uncategorized diagnosis data
Figure 3A.
Figure 3A.
Random Forest variable importance plot by diagnosis subcategory: top 50 variables
Figure 3B.
Figure 3B.
Random Forest variable importance plot of uncategorized diagnosis data: top 50 variables

Similar articles

Cited by

References

    1. Schulte F As Coronavirus Strikes, Crucial Data In Electronic Health Records Hard To Harvest. Kaiser Health News; 2020.
    1. Chau A, Ehrenfeld JM. Using real-time clinical decision support to improve performance on perioperative quality and process measures. Anesthesiol Clin. 2011;29(1):57–69. - PMC - PubMed
    1. Zafar HM, Ip IK, Mills AM, Raja AS, Langlotz CP, Khorasani R. Effect of Clinical Decision Support-Generated Report Cards Versus Real-Time Alerts on Primary Care Provider Guideline Adherence for Low Back Pain Outpatient Lumbar Spine MRI Orders. AJR Am J Roentgenol. 2019;212(2):386–394. - PubMed
    1. Eichler HG, Bloechl-Daum B, Broich K, et al. Data Rich, Information Poor: Can We Use Electronic Health Records to Create a Learning Healthcare System for Pharmaceuticals? Clin Pharmacol Ther. 2019;105(4):912–922. - PMC - PubMed
    1. Krumholz HM, Terry SF, Waldstreicher J. Data Acquisition, Curation, and Use for a Continuously Learning Health System. JAMA. 2016;316(16):1669–1670. - PubMed

Publication types