Subcategorizing EHR diagnosis codes to improve clinical application of machine learning models
- PMID: 34607290
- PMCID: PMC8571032
- DOI: 10.1016/j.ijmedinf.2021.104588
Subcategorizing EHR diagnosis codes to improve clinical application of machine learning models
Abstract
Background: Electronic health record (EHR) data is commonly used for secondary purposes such as research and clinical decision support. However, reuse of EHR data presents several challenges including but not limited to identifying all diagnoses associated with a patient's clinical encounter. The purpose of this study was to assess the feasibility of developing a schema to identify and subclassify all structured diagnosis codes for a patient encounter.
Methods: To develop a subclassification schema we used EHR data from an interhospital transport data repository that contained complete hospital encounter level data. Eight discrete data sources containing structured diagnosis codes were identified. Diagnosis codes were normalized using the Unified Medical Language System and additional EHR data were combined with standardized terminologies to create and validate the subcategories. We then employed random forest to assess the usefulness of the new subcategorized diagnoses to predict post-interhospital transfer mortality by building 2 models, one using standard diagnosis codes, and one using the new subcategorized diagnosis codes.
Results: Six subcategories of diagnoses were identified and validated. The subcategories included: primary or admitting diagnoses (10%), past medical, surgical or social history (9%), problem list (20%), comorbidity (24%), discharge diagnoses (6%), and unmapped diagnoses (31%). The subcategorized model outperformed the standard model, achieving a training AUROC of 0.97 versus 0.95 and testing model AUROC of 0.81 versus 0.46.
Discussion: Our work demonstrates that merging structured diagnosis codes with additional EHR data and secondary data sources provides additional information to understand the role of diagnosis throughout a clinical encounter and improves predictive model performance. Further work is necessary to assess if subcategorizing produces benefits in interpreting the results of prognostic models and/or operationalizing the results in clinical decision support applications.
Keywords: Data management; Electronic data processing; Electronic health records; Machine learning.
Copyright © 2021 Elsevier B.V. All rights reserved.
Conflict of interest statement
Competing Interests Statement
The authors have no competing interests to declare.
Figures





Similar articles
-
A method for cohort selection of cardiovascular disease records from an electronic health record system.Int J Med Inform. 2017 Jun;102:138-149. doi: 10.1016/j.ijmedinf.2017.03.015. Epub 2017 Mar 30. Int J Med Inform. 2017. PMID: 28495342
-
Using clinical text to refine unspecific condition codes in Dutch general practitioner EHR data.Int J Med Inform. 2024 Sep;189:105506. doi: 10.1016/j.ijmedinf.2024.105506. Epub 2024 May 29. Int J Med Inform. 2024. PMID: 38820647
-
Combining chest X-rays and electronic health record (EHR) data using machine learning to diagnose acute respiratory failure.J Am Med Inform Assoc. 2022 May 11;29(6):1060-1068. doi: 10.1093/jamia/ocac030. J Am Med Inform Assoc. 2022. PMID: 35271711 Free PMC article.
-
Adult patient access to electronic health records.Cochrane Database Syst Rev. 2021 Feb 26;2(2):CD012707. doi: 10.1002/14651858.CD012707.pub2. Cochrane Database Syst Rev. 2021. PMID: 33634854 Free PMC article.
-
Clinical code set engineering for reusing EHR data for research: A review.J Biomed Inform. 2017 Jun;70:1-13. doi: 10.1016/j.jbi.2017.04.010. Epub 2017 Apr 22. J Biomed Inform. 2017. PMID: 28442434 Review.
Cited by
-
Multidimensional analysis of job advertisements for medical record information managers.Front Public Health. 2022 Nov 4;10:905054. doi: 10.3389/fpubh.2022.905054. eCollection 2022. Front Public Health. 2022. PMID: 36408003 Free PMC article.
-
High-risk diagnosis combinations in patients undergoing interhospital transfer: a retrospective observational study.BMC Emerg Med. 2022 Nov 24;22(1):187. doi: 10.1186/s12873-022-00742-1. BMC Emerg Med. 2022. PMID: 36418974 Free PMC article.
-
Predicting heart failure in-hospital mortality by integrating longitudinal and category data in electronic health records.Med Biol Eng Comput. 2023 Jul;61(7):1857-1873. doi: 10.1007/s11517-023-02816-z. Epub 2023 Mar 24. Med Biol Eng Comput. 2023. PMID: 36959414
References
-
- Schulte F As Coronavirus Strikes, Crucial Data In Electronic Health Records Hard To Harvest. Kaiser Health News; 2020.
-
- Zafar HM, Ip IK, Mills AM, Raja AS, Langlotz CP, Khorasani R. Effect of Clinical Decision Support-Generated Report Cards Versus Real-Time Alerts on Primary Care Provider Guideline Adherence for Low Back Pain Outpatient Lumbar Spine MRI Orders. AJR Am J Roentgenol. 2019;212(2):386–394. - PubMed
-
- Krumholz HM, Terry SF, Waldstreicher J. Data Acquisition, Curation, and Use for a Continuously Learning Health System. JAMA. 2016;316(16):1669–1670. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous