Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus
- PMID: 22249968
- PMCID: PMC3277630
- DOI: 10.1136/amiajnl-2011-000597
Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus
Abstract
Objective: To evaluate data fragmentation across healthcare centers with regard to the accuracy of a high-throughput clinical phenotyping (HTCP) algorithm developed to differentiate (1) patients with type 2 diabetes mellitus (T2DM) and (2) patients with no diabetes.
Materials and methods: This population-based study identified all Olmsted County, Minnesota residents in 2007. We used provider-linked electronic medical record data from the two healthcare centers that provide >95% of all care to County residents (ie, Olmsted Medical Center and Mayo Clinic in Rochester, Minnesota, USA). Subjects were limited to residents with one or more encounter January 1, 2006 through December 31, 2007 at both healthcare centers. DM-relevant data on diagnoses, laboratory results, and medication from both centers were obtained during this period. The algorithm was first executed using data from both centers (ie, the gold standard) and then from Mayo Clinic alone. Positive predictive values and false-negative rates were calculated, and the McNemar test was used to compare categorization when data from the Mayo Clinic alone were used with the gold standard. Age and sex were compared between true-positive and false-negative subjects with T2DM. Statistical significance was accepted as p<0.05.
Results: With data from both medical centers, 765 subjects with T2DM (4256 non-DM subjects) were identified. When single-center data were used, 252 T2DM subjects (1573 non-DM subjects) were missed; an additional false-positive 27 T2DM subjects (215 non-DM subjects) were identified. The positive predictive values and false-negative rates were 95.0% (513/540) and 32.9% (252/765), respectively, for T2DM subjects and 92.6% (2683/2898) and 37.0% (1573/4256), respectively, for non-DM subjects. Age and sex distribution differed between true-positive (mean age 62.1; 45% female) and false-negative (mean age 65.0; 56.0% female) T2DM subjects.
Conclusion: The findings show that application of an HTCP algorithm using data from a single medical center contributes to misclassification. These findings should be considered carefully by researchers when developing and executing HTCP algorithms.
Conflict of interest statement
Figures


Similar articles
-
The absence of longitudinal data limits the accuracy of high-throughput clinical phenotyping for identifying type 2 diabetes mellitus subjects.Int J Med Inform. 2013 Apr;82(4):239-47. doi: 10.1016/j.ijmedinf.2012.05.015. Epub 2012 Jul 2. Int J Med Inform. 2013. PMID: 22762862 Free PMC article.
-
Impact of Diverse Data Sources on Computational Phenotyping.Front Genet. 2020 Jun 3;11:556. doi: 10.3389/fgene.2020.00556. eCollection 2020. Front Genet. 2020. PMID: 32582289 Free PMC article.
-
Development and validation of algorithms to identify newly diagnosed type 1 and type 2 diabetes in pediatric population using electronic medical records and claims data.Pharmacoepidemiol Drug Saf. 2019 Feb;28(2):234-243. doi: 10.1002/pds.4728. Epub 2019 Jan 24. Pharmacoepidemiol Drug Saf. 2019. PMID: 30677205
-
Development of Type 2 Diabetes Mellitus Phenotyping Framework Using Expert Knowledge and Machine Learning Approach.J Diabetes Sci Technol. 2017 Jul;11(4):791-799. doi: 10.1177/1932296816681584. Epub 2016 Dec 7. J Diabetes Sci Technol. 2017. PMID: 27932531 Free PMC article.
-
A Systematic Review of Case-Identification Algorithms Based on Italian Healthcare Administrative Databases for Two Relevant Diseases of the Endocrine System: Diabetes Mellitus and Thyroid Disorders.Epidemiol Prev. 2019 Jul-Aug;43(4 Suppl 2):17-36. doi: 10.19191/EP19.4.S2.P008.089. Epidemiol Prev. 2019. PMID: 31650804
Cited by
-
Extracting research-quality phenotypes from electronic health records to support precision medicine.Genome Med. 2015 Apr 30;7(1):41. doi: 10.1186/s13073-015-0166-y. eCollection 2015. Genome Med. 2015. PMID: 25937834 Free PMC article.
-
Intelligent use and clinical benefits of electronic health records in rheumatoid arthritis.Expert Rev Clin Immunol. 2015 Mar;11(3):329-37. doi: 10.1586/1744666X.2015.1009895. Epub 2015 Feb 8. Expert Rev Clin Immunol. 2015. PMID: 25660652 Free PMC article. Review.
-
Optimized Identification of Advanced Chronic Kidney Disease and Absence of Kidney Disease by Combining Different Electronic Health Data Resources and by Applying Machine Learning Strategies.J Clin Med. 2020 Sep 12;9(9):2955. doi: 10.3390/jcm9092955. J Clin Med. 2020. PMID: 32932685 Free PMC article.
-
Assessing the Quality of Electronic Data for 'Fit-for-Purpose' by Utilizing Data Profiling Techniques Prior to Conducting a Survival Analysis for Adults with Acute Lymphoblastic Leukemia.AMIA Annu Symp Proc. 2021 Jan 25;2020:915-924. eCollection 2020. AMIA Annu Symp Proc. 2021. PMID: 33936467 Free PMC article.
-
Automated detection of altered mental status in emergency department clinical notes: a deep learning approach.BMC Med Inform Decis Mak. 2019 Aug 19;19(1):164. doi: 10.1186/s12911-019-0894-9. BMC Med Inform Decis Mak. 2019. PMID: 31426779 Free PMC article.
References
-
- Delay No More: Improve Patient Recruitment and Reduce Time to Market in the Pharmaceutical Industry. https://www-935.ibm.com/services/sg/index.wss/ibvstudy/igs/x1014229?cntx...
-
- Wilke RA, Berg RL, Linneman JG, et al. Characterization of low-density lipoprotein cholesterol-lowering efficacy for atorvastatin in a population-based DNA biorepository. Basic Clin Pharmacol Toxicol 2008;103:354–9 - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical
Molecular Biology Databases