The absence of longitudinal data limits the accuracy of high-throughput clinical phenotyping for identifying type 2 diabetes mellitus subjects
- PMID: 22762862
- PMCID: PMC3478423
- DOI: 10.1016/j.ijmedinf.2012.05.015
The absence of longitudinal data limits the accuracy of high-throughput clinical phenotyping for identifying type 2 diabetes mellitus subjects
Abstract
Purpose: To evaluate the impact of insufficient longitudinal data on the accuracy of a high-throughput clinical phenotyping (HTCP) algorithm for identifying (1) patients with type 2 diabetes mellitus (T2DM) and (2) patients with no diabetes.
Methods: Retrospective study conducted at Mayo Clinic in Rochester, Minnesota. Eligible subjects were Olmsted County residents with ≥1 Mayo Clinic encounter in each of three time periods: (1) 2007, (2) from 1997 through 2006, and (3) before 1997 (N = 54,283). Diabetes relevant electronic medical record (EMR) data about diagnoses, laboratories, and medications were used. We employed the HTCP algorithm to categorize individuals as T2DM cases and non-diabetes controls. Considering the full 11 years (1997-2007) as the gold standard, we compared gold-standard categorizations with those using data for 10 subsequent intervals, ranging from 1998-2007 (10-year data) to 2007 (1-year data). Positive predictive values (PPVs) and false-negative rates (FNRs) were calculated. McNemar tests were used to determine whether categorizations using shorter time periods differed from the gold standard. Statistical significance was defined as P < 0.05.
Results: We identified 2770 T2DM cases and 21,005 controls when the algorithm was applied using 11-year data. Using 2007 data alone, PPVs and FNRs, respectively, were 70% and 25% for case identification and 59% and 67% for control identification. All time frames differed significantly from the gold standard, except for the 10-year period.
Conclusions: The accuracy of the algorithm reduced remarkably as data were limited to shorter observation periods. This impact should be considered carefully when designing/executing HTCP algorithms.
Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Conflict of interest statement
There are no conflicts of interests.
Figures


Similar articles
-
A machine learning-based framework to identify type 2 diabetes through electronic health records.Int J Med Inform. 2017 Jan;97:120-127. doi: 10.1016/j.ijmedinf.2016.09.014. Epub 2016 Oct 1. Int J Med Inform. 2017. PMID: 27919371 Free PMC article.
-
Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus.J Am Med Inform Assoc. 2012 Mar-Apr;19(2):219-24. doi: 10.1136/amiajnl-2011-000597. Epub 2012 Jan 16. J Am Med Inform Assoc. 2012. PMID: 22249968 Free PMC article.
-
Development of Type 2 Diabetes Mellitus Phenotyping Framework Using Expert Knowledge and Machine Learning Approach.J Diabetes Sci Technol. 2017 Jul;11(4):791-799. doi: 10.1177/1932296816681584. Epub 2016 Dec 7. J Diabetes Sci Technol. 2017. PMID: 27932531 Free PMC article.
-
Validating an ontology-based algorithm to identify patients with type 2 diabetes mellitus in electronic health records.Int J Med Inform. 2014 Oct;83(10):768-78. doi: 10.1016/j.ijmedinf.2014.06.002. Epub 2014 Jun 20. Int J Med Inform. 2014. PMID: 25011429
-
A Systematic Review of Case-Identification Algorithms Based on Italian Healthcare Administrative Databases for Two Relevant Diseases of the Endocrine System: Diabetes Mellitus and Thyroid Disorders.Epidemiol Prev. 2019 Jul-Aug;43(4 Suppl 2):17-36. doi: 10.19191/EP19.4.S2.P008.089. Epidemiol Prev. 2019. PMID: 31650804
Cited by
-
Constructing Epidemiologic Cohorts from Electronic Health Record Data.Int J Environ Res Public Health. 2021 Dec 14;18(24):13193. doi: 10.3390/ijerph182413193. Int J Environ Res Public Health. 2021. PMID: 34948800 Free PMC article.
-
Impact of Diverse Data Sources on Computational Phenotyping.Front Genet. 2020 Jun 3;11:556. doi: 10.3389/fgene.2020.00556. eCollection 2020. Front Genet. 2020. PMID: 32582289 Free PMC article.
-
Comparative effectiveness of explainable machine learning approaches for extrauterine growth restriction classification in preterm infants using longitudinal data.Front Med (Lausanne). 2023 Nov 29;10:1166743. doi: 10.3389/fmed.2023.1166743. eCollection 2023. Front Med (Lausanne). 2023. PMID: 38093981 Free PMC article.
-
A machine learning-based framework to identify type 2 diabetes through electronic health records.Int J Med Inform. 2017 Jan;97:120-127. doi: 10.1016/j.ijmedinf.2016.09.014. Epub 2016 Oct 1. Int J Med Inform. 2017. PMID: 27919371 Free PMC article.
-
Beyond Phecodes: leveraging PheMAP to identify patients lacking diagnosis codes in electronic health records.J Am Med Inform Assoc. 2025 Jun 1;32(6):1007-1014. doi: 10.1093/jamia/ocaf055. J Am Med Inform Assoc. 2025. PMID: 40156924
References
-
- Kohane IS. Using electronic health records to drive discovery in disease genomics. Nat Rev Genet. 2011;12(6):417–28. - PubMed
-
- Wilke RA, et al. Characterization of low-density lipoprotein cholesterol-lowering efficacy for atorvastatin in a population-based DNA biorepository. Basic Clin Pharmacol Toxicol. 2008;103(4):354–9. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Medical
Molecular Biology Databases