Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec:152:164-175.
doi: 10.1016/j.jclinepi.2022.10.011. Epub 2022 Oct 11.

In simulated data and health records, latent class analysis was the optimum multimorbidity clustering algorithm

Affiliations

In simulated data and health records, latent class analysis was the optimum multimorbidity clustering algorithm

Linda Nichols et al. J Clin Epidemiol. 2022 Dec.

Abstract

Background and objectives: To investigate the reproducibility and validity of latent class analysis (LCA) and hierarchical cluster analysis (HCA), multiple correspondence analysis followed by k-means (MCA-kmeans) and k-means (kmeans) for multimorbidity clustering.

Methods: We first investigated clustering algorithms in simulated datasets with 26 diseases of varying prevalence in predetermined clusters, comparing the derived clusters to known clusters using the adjusted Rand Index (aRI). We then them investigated the medical records of male patients, aged 65 to 84 years from 50 UK general practices, with 49 long-term health conditions. We compared within cluster morbidity profiles using the Pearson correlation coefficient and assessed cluster stability using in 400 bootstrap samples.

Results: In the simulated datasets, the closest agreement (largest aRI) to known clusters was with LCA and then MCA-kmeans algorithms. In the medical records dataset, all four algorithms identified one cluster of 20-25% of the dataset with about 82% of the same patients across all four algorithms. LCA and MCA-kmeans both found a second cluster of 7% of the dataset. Other clusters were found by only one algorithm. LCA and MCA-kmeans clustering gave the most similar partitioning (aRI 0.54).

Conclusion: LCA achieved higher aRI than other clustering algorithms.

Keywords: Clustering methods; Electronic medical records; Hierarchical cluster analysis; K-means; Latent class analysis; Multimorbidity; Multiple correspondence analysis.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Tom Marshall reports financial support was provided by UKRI - research grant from NIHR-MRC for BIRMCAM study. Jessica Barrett reports financial support was provided by Medical Research Council (Biostatistics Unit). Sylvia Richardson reports financial support was provided by Medical Research Council (Biostatistics Unit). Paul Kirk reports was provided by Medical Research Council (Biostatistics Unit). Linda Nichols reports financial support was provided by UKRI - research grant from NIHR-MRC for BIRMCAM study. Tom Taverner reports financial support was provided by UKRI - research grant from NIHR-MRC for BIRMCAM study. Krishnarajah Nirantharakumar reports financial support was provided by Health Data Research UK - fellowship. Paul Kirk reports a relationship with Director, Health Data Science, AstraZeneca that includes: employment.

Figures

Fig. 1
Fig. 1
Simulated dataset of patients with two or more conditions in 3 clusters, within cluster disease prevalence approximately 15%, noise approximately 0.5%, overlap of diseases between clusters: examining the effect of varying correlation of diseases within a cluster. Error bars show interquartile range (IQR).
Fig. 2
Fig. 2
Simulated dataset of patients with two or more conditions in 3 clusters, within cluster disease prevalence approximately 15%, correlation = 0.5, overlap of diseases between clusters: examining the effect of varying the amount of noise.
Fig. 3
Fig. 3
Simulated dataset of patients with two or more conditions in 3 clusters, noise approximately 4%, correlation = 0.5, overlap of diseases between clusters: examining the effect of varying within cluster prevalence of disease.
Fig. 4
Fig. 4
Simulated dataset of patients with two or more conditions in 4 clusters, within cluster disease prevalence approximately 24%, noise approximately 0.5%, correlation = 0.5, overlap of diseases between clusters: examining the effect of varying the number of clusters algorithm is asked to find.

References

    1. The Academy of Medical Sciences. Multimorbidity: a priority for global health research. Acad Med Sci. 2018:1–127.
    1. den Akker M, Buntinx F, Knottnerus JV. Comorbidity or multimorbidity: what’s in a name? A review of literature. Eur J Gen Pract. 1996;2:65–70.
    1. Barnett K, Mercer SW, Norbury M, Watt G, Wyke S, Guthrie B. Epidemiology of multimorbidity and implications for health care, research, and medical education: a cross-sectional study. Lancet. 2012;380:37–43. - PubMed
    1. Kingston A, Robinson L, Booth H, Knapp M, Jagger C. MODEM project. Projections of multi-morbidity in the older population in England to 2035: estimates from the Population Ageing and Care Simulation (PACSim) model. Age Ageing. 2018;47:374–80. - PMC - PubMed
    1. Cassell A, Edwards D, Harshfield A, Rhodes K, Brimicombe J, Payne R, et al. The epidemiology of multimorbidity in primary care: a retrospective cohort study. Br J Gen Pract. 2018;68(669):e245–51. - PMC - PubMed