. 2019 Dec 10:8:e44941.

doi: 10.7554/eLife.44941.

Linking glycemic dysregulation in diabetes to symptoms, comorbidities, and genetics through EHR data mining

Isa Kristina Kirk^#¹, Christian Simon^#¹, Karina Banasik¹, Peter Christoffer Holm¹, Amalie Dahl Haue¹, Peter Bjødstrup Jensen^{1

2}, Lars Juhl Jensen¹, Cristina Leal Rodríguez¹, Mette Krogh Pedersen¹, Robert Eriksson¹, Henrik Ullits Andersen³, Thomas Almdal^{3

4}, Jette Bork-Jensen⁵, Niels Grarup⁵, Knut Borch-Johnsen⁶, Oluf Pedersen^{3

5}, Flemming Pociot^{3

7}, Torben Hansen^{3

5}, Regine Bergholdt³, Peter Rossing^{3

8}, Søren Brunak^{1

9}

Affiliations

¹ Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark.
² Odense Patient Data Explorative Network (OPEN), Odense University Hospital, Odense, Denmark.
³ Steno Diabetes Center Copenhagen, Gentofte, Denmark.
⁴ Department of Endocrinology, Rigshospitalet, Copenhagen, Denmark.
⁵ Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark.
⁶ Holbæk Hospital, Holbæk, Denmark.
⁷ Department of Clinical Medicine, Herlev-Gentofte Hospital, Herlev, Denmark.
⁸ Department of Clinical Medicine, University of Copenhagen, Copenhagen, Denmark.
⁹ Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark.

^# Contributed equally.

PMID: 31818369
PMCID: PMC6904221
DOI: 10.7554/eLife.44941

Linking glycemic dysregulation in diabetes to symptoms, comorbidities, and genetics through EHR data mining

Isa Kristina Kirk et al. Elife. 2019.

. 2019 Dec 10:8:e44941.

doi: 10.7554/eLife.44941.

Authors

Affiliations

¹ Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark.
² Odense Patient Data Explorative Network (OPEN), Odense University Hospital, Odense, Denmark.
³ Steno Diabetes Center Copenhagen, Gentofte, Denmark.
⁴ Department of Endocrinology, Rigshospitalet, Copenhagen, Denmark.
⁵ Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark.
⁶ Holbæk Hospital, Holbæk, Denmark.
⁷ Department of Clinical Medicine, Herlev-Gentofte Hospital, Herlev, Denmark.
⁸ Department of Clinical Medicine, University of Copenhagen, Copenhagen, Denmark.
⁹ Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark.

^# Contributed equally.

PMID: 31818369
PMCID: PMC6904221
DOI: 10.7554/eLife.44941

Abstract

Diabetes is a diverse and complex disease, with considerable variation in phenotypic manifestation and severity. This variation hampers the study of etiological differences and reduces the statistical power of analyses of associations to genetics, treatment outcomes, and complications. We address these issues through deep, fine-grained phenotypic stratification of a diabetes cohort. Text mining the electronic health records of 14,017 patients, we matched two controlled vocabularies (ICD-10 and a custom vocabulary developed at the clinical center Steno Diabetes Center Copenhagen) to clinical narratives spanning a 19 year period. The two matched vocabularies comprise over 20,000 medical terms describing symptoms, other diagnoses, and lifestyle factors. The cohort is genetically homogeneous (Caucasian diabetes patients from Denmark) so the resulting stratification is not driven by ethnic differences, but rather by inherently dissimilar progression patterns and lifestyle related risk factors. Using unsupervised Markov clustering, we defined 71 clusters of at least 50 individuals within the diabetes spectrum. The clusters display both distinct and shared longitudinal glycemic dysregulation patterns, temporal co-occurrences of comorbidities, and associations to single nucleotide polymorphisms in or near genes relevant for diabetes comorbidities.

Keywords: EHR; comorbidities; computational biology; diabetes; diabetes subtypes; epidemiology; genotyping; global health; human; systems biology; text mining.

PubMed Disclaimer

Conflict of interest statement

IK, CS, KB, PH, AH, PJ, LJ, CR, MP, RE, HA, TA, JB, NG, KB, OP, FP, TH, RB, PR, SB No competing interests declared

Figures

**Figure 1.. Comparison of distributions of ICD-10 diagnosis codes with and without text mining.**
(A) Percentage of diagnosis codes belonging to the different ICD-10 chapters and the relative increase in diagnosis codes from the different chapters when combining the text-mined and assigned codes. (B) Age distributions of text-mined and assigned ICD-10 diagnosis codes from the SDCC corpus divided into the 21 ICD-10 chapters.

**Figure 1—figure supplement 2.. Physiological and biochemical tests in the SDCC corpus.**
Bar plots of unique individuals who have had the test taken (grey bars) and the number of times each individual have had the test taken (red outline bars).

**Figure 1—figure supplement 3.. Linear Discriminant Analysis 1 (LDA).**
Linear Discriminant Analysis (LDA) was performed on the biochemical tests for the 71 clusters with at least 50 individuals. The linear discriminants (LD) 1 and 2 (A) and 1 and 3 (B) for the LDA, are shown using the biochemical test identifiers. Identifiers in blue contributes most to the variance among clusters for LD1, purple identifiers contribute most to the variance from LD2 or LD3, and green colored identifiers are common across LD1 and LD2 or LD1 and LD3.

**Figure 1—figure supplement 4.. Linear Discriminant Analysis 2 (LDA).**
The three identifiers contributing most to the variance in the LDA (NPU04998, NPU18004, and SDCNOTAT_BTSys) were removed, and a new LDA analysis was performed. A and B display the relationship between LD 1, 2, and 3. Blue colored identifiers contribute the most to the variance between clusters for LD1, purple colored contribute the most to the variance for LD2 and 3, and green colored identifiers are the ones common among LD1 and 2, or LD2 and LD3.

**Figure 1—figure supplement 5.. Distribution of HbA1c measurements for T1D and T2D patients.**
The vertical line corresponds to the HbA1c threshold used when defining dysregulation.

**Figure 1—figure supplement 6.. Biochemical patterns for the level of glycemic dysregulation.**
The groups are based on numbers of parameters of glycemic dysregulation. A MANOVA test was performed to detect if there were any differences in means among the groups, for each biochemical test (Bonferroni adj. p-value<=0.01). These groups are marked with an asterisk (*). Subsequently a Kolmogorov-Smirnov test was applied to discover whether the distribution of mean biochemical values for each group was significantly higher or lower than the other groups (Bonferroni adj. p-value<=0.01). Blue indicates mean distributions that are significantly higher than the other groups, and red indicates significantly lower distributions. They grey and less clear color indicates that the distribution within this group was not significantly different from the other groups. B = blood, p=Plasma, S = serum, U = urine.

**Figure 2.. Phenotypic clusters found in the SDCC cohort.**
The clustering was created with diagnosis vectors of 13,928 patients (with text in the record) comprising both text-mined and assigned ICD-10 codes. A total of 172 clusters were created, where 11,208 patients (80.47%) were captured in the clustering (clusters with five or less patients were discarded for statistical reasons). (A) Each node represents a patient within the corpus colored by the association to one of the 172 unique clusters. (B) The 71 clusters with at least 50 patients colored with the same palette as in (A).

**Figure 2—figure supplement 1.. Density of days in contact with SDCC for each cluster.**
Density diagram of days an individual has been connected to Steno Diabetes Center Copenhagen (SDCC) divided by each cluster. The black line indicates the mean of SDCC connection time for the entire cohort. Some clusters for example Cluster1, show two peaks indicating that there are at least two groups of individuals in a cluster; one connected to SDCC more than the average cohort, and one connected less.

**Figure 2—figure supplement 2.. Distribution of assigned primary diabetes type for each cluster.**
Density diagram of days an individual has been connected to Steno Diabetes Center Copenhagen (SDCC) divided by each cluster. The black line indicates the mean of SDCC connection time for the entire cohort. Some clusters for example Cluster1, show two peaks indicating that there are at least two groups of individuals in a cluster; one connected to SDCC more than the average cohort, and one connected less.

**Figure 2—figure supplement 3.. Distribution of age for each cluster.**
Density diagram of days an individual has been connected to Steno Diabetes Center Copenhagen (SDCC) divided by each cluster. The black line indicates the mean of SDCC connection time for the entire cohort. Some clusters for example Cluster1, show two peaks indicating that there are at least two groups of individuals in a cluster; one connected to SDCC more than the average cohort, and one connected less.

**Figure 2—figure supplement 4.. Distribution of duration of diabetes for each cluster.**
The diabetes duration distribution for individuals in each cluster. Each bar corresponds to a bin for a given interval of the diabetes duration. The height of the bins is the percentage of individuals in the cluster being in that diabetes age bin. The diabetes duration is calculated as the difference in years between diabetes onset and the date for the latest SDCC data entry.

**Figure 2—figure supplement 5.. Clustering robustness analysis.**
To assess the robustness of the clustering, various diluted (points in blue) and shuffled realizations (points in red) of the similarity network were used as input for the MCL algorithm, and the resulting clustering’s were compared to the reference clustering using the Variation of Information (VI) measure. The two horizontal lines show the value that the VI would take if we were to randomly assign 10% and 20% of the vertices to different random clusters, respectively.

**Figure 3.. Hierarchical clustering based on enriched comorbid ICD-10 diagnoses.**
The comorbidities present in a minimum of 10 patients and significantly enriched (adj. p-value<=0.05) in each cluster are shown in the pie charts. The number of significant codes ranges from 1 to 10. Each color corresponds to an ICD-10 code chapter as listed in the legend of Figure 1. Six main groups and an outlier (cluster 70) resulted, and the colors of the dendrogram branches indicate to which hierarchical groups the clusters belong. The size of the pie charts represents the average diabetes duration (years with diabetes) divided into six bins. The 21 clusters where at least 50% of the patients have three or more HbA1c severity parameters are marked with a red line surrounding the pie chart.

**Figure 4.. Comorbidity patterns within the six symptom groups.**
(A) Comorbidity correlations between the combined symptom groups. (B) Asymmetric comorbidity matrix for observing row diagnosis codes before column diagnoses. First, we calculated Bonferroni corrected p-values for diagnosis pair directionality, second, we extracted the top 100 unique diagnosis codes pairs with lowest adjusted p-values and lastly, we calculated a comorbidity score (CS) by using the log2 of observing the pair more or less than expected. The heat-map colors reflect the CS quantification. (C) Comorbidity pairs unique for each of the symptom groups. All interactions are observed significantly more (blue) or less (red) than expected (adj. p-value<=0.01). Arrows indicate that the diagnoses are observed in the particular order (Fischer’s exact test with Bonferroni correction p-value<=0.01). Node size indicates in how many symptom groups the diagnosis code is observed in, ranging from one group (the diagnosis is unique for the group, largest nodes) to six groups (all groups have the code, smallest nodes).

See this image and copyright information in PMC

References

1. Achenbach P, Warncke K, Reiter J, Naserke HE, Williams AJ, Bingley PJ, Bonifacio E, Ziegler AG. Stratification of type 1 diabetes risk on the basis of islet autoantibody characteristics. Diabetes. 2004;53:384–392. doi: 10.2337/diabetes.53.2.384. - DOI - PubMed
1. Adeghate E, Schattner P, Dunn E. An update on the etiology and epidemiology of diabetes mellitus. Annals of the New York Academy of Sciences. 2006;1084:1–29. doi: 10.1196/annals.1372.029. - DOI - PubMed
1. Ahlqvist E, van Zuydam NR, Groop LC, McCarthy MI. The genetics of diabetic complications. Nature Reviews Nephrology. 2015;11:277–287. doi: 10.1038/nrneph.2015.37. - DOI - PubMed
1. Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, Vikman P, Prasad RB, Aly DM, Almgren P, Wessman Y, Shaat N, Spégel P, Mulder H, Lindholm E, Melander O, Hansson O, Malmqvist U, Lernmark Å, Lahti K, Forsén T, Tuomi T, Rosengren AH, Groop L. Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. The Lancet Diabetes & Endocrinology. 2018;6:361–369. doi: 10.1016/S2213-8587(18)30051-2. - DOI - PubMed
1. American Diabetes Association 2. classification and diagnosis of diabetes. Diabetes Care. 2017;40:S11–S24. doi: 10.2337/dc17-S005. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Consumer Health Information
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Linking glycemic dysregulation in diabetes to symptoms, comorbidities, and genetics through EHR data mining

Affiliations

Linking glycemic dysregulation in diabetes to symptoms, comorbidities, and genetics through EHR data mining

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical