Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 8;21(1):343.
doi: 10.1186/s12911-021-01693-6.

Identifying and evaluating clinical subtypes of Alzheimer's disease in care electronic health records using unsupervised machine learning

Affiliations

Identifying and evaluating clinical subtypes of Alzheimer's disease in care electronic health records using unsupervised machine learning

Nonie Alexander et al. BMC Med Inform Decis Mak. .

Abstract

Background: Alzheimer's disease (AD) is a highly heterogeneous disease with diverse trajectories and outcomes observed in clinical populations. Understanding this heterogeneity can enable better treatment, prognosis and disease management. Studies to date have mainly used imaging or cognition data and have been limited in terms of data breadth and sample size. Here we examine the clinical heterogeneity of Alzheimer's disease patients using electronic health records (EHR) to identify and characterise disease subgroups using multiple clustering methods, identifying clusters which are clinically actionable.

Methods: We identified AD patients in primary care EHR from the Clinical Practice Research Datalink (CPRD) using a previously validated rule-based phenotyping algorithm. We extracted and included a range of comorbidities, symptoms and demographic features as patient features. We evaluated four different clustering methods (k-means, kernel k-means, affinity propagation and latent class analysis) to cluster Alzheimer's disease patients. We compared clusters on clinically relevant outcomes and evaluated each method using measures of cluster structure, stability, efficiency of outcome prediction and replicability in external data sets.

Results: We identified 7,913 AD patients, with a mean age of 82 and 66.2% female. We included 21 features in our analysis. We observed 5, 2, 5 and 6 clusters in k-means, kernel k-means, affinity propagation and latent class analysis respectively. K-means was found to produce the most consistent results based on four evaluative measures. We discovered a consistent cluster found in three of the four methods composed of predominantly female, younger disease onset (43% between ages 42-73) diagnosed with depression and anxiety, with a quicker rate of progression compared to the average across other clusters.

Conclusion: Each clustering approach produced substantially different clusters and K-Means performed the best out of the four methods based on the four evaluative criteria. However, the consistent appearance of one particular cluster across three of the four methods potentially suggests the presence of a distinct disease subtype that merits further exploration. Our study underlines the variability of the results obtained from different clustering approaches and the importance of systematically evaluating different approaches for identifying disease subtypes in complex EHR.

Keywords: Alzheimer's disease; Clustering; EHR; K-means; Subtyping.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Reproducibility validation flow diagram showing how the AD cohort and UD cohort are used to validate the original AD clustering in different datasets: (1) splitting AD cohort into trial and test set, (2) using trial set to cluster patients using a cluster method, (3) split training set into a decision tree training and cross validation, then train a decision tree, (4) label test sets with trained decision tree (gold standard labels), (5) repeat cluster method, (6) find % discordance between decision tree labels and cluster labels to quantify reproducibility. AD Alzheimer's disease, UD unspecified dementia
Fig. 2
Fig. 2
Outcomes of K-means clustering by cluster: A number of appointments per year post diagnosis with 5% confidence intervals, B number of missed appointments per year post diagnosis with 5% confidence intervals, C Progression rate based on decline in MMSE score per year with 5% confidence intervals, D time from onset of AD until AChls are stopped prescribed, with 5% confidence intervals, E Kaplan–Meier curve from diagnosis to death with log rank error, F Kaplan–Meier curve for time until the patient moves into assisted living with log rank error bars. AD Alzheimer's disease
Fig. 3
Fig. 3
Silhouette plots of all samples results from: A k-means, B Kernel k-means, C Affinity propagation, D LCA. The dotted line represents the average silhouette score across all methods
Fig. 4
Fig. 4
Alluvial plots showing patients transition to different clusters for each clustering method, A the colour represents the cluster in membership from k-means. B Highlights the anxiety and depression cluster for k-means, affinity propagation and LCA. HT hypertension, HL hearing loss, AP affinity propagation

References

    1. Ferrari C, Lombardi G, Polito C, Lucidi G, Bagnoli S, Piaceri I, et al. Alzheimer’s disease progression: factors influencing cognitive decline. J Alzheimers Dis. 2017;61(2):785–791. - PubMed
    1. Wattmo C, Wallin ÅK. Early-versus late-onset Alzheimer’s disease in clinical practice: cognitive and global outcomes over 3 years. Alzheimers Res Ther. 2017;9(1):70. - PMC - PubMed
    1. Ravona-Springer R, Luo X, Schmeidler J, Wysocki M, Lesser G, Rapp M, et al. Diabetes is associated with increased rate of cognitive decline in questionably demented elderly. Dement Geriatr Cogn Disord. 2010;29(1):68–74. - PMC - PubMed
    1. Modrego PJ, Lobo A. Determinants of progression and mortality in Alzheimers disease: a systematic review. Neuropsychiatry. 2018 doi: 10.4172/Neuropsychiatry.1000479. - DOI
    1. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. Berlin: Springer; 2013. p. 536.

Publication types