Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 10;14(1):1045.
doi: 10.1038/s41598-024-51699-z.

Deep embedded clustering generalisability and adaptation for integrating mixed datatypes: two critical care cohorts

Affiliations

Deep embedded clustering generalisability and adaptation for integrating mixed datatypes: two critical care cohorts

Jip W T M de Kok et al. Sci Rep. .

Abstract

We validated a Deep Embedded Clustering (DEC) model and its adaptation for integrating mixed datatypes (in this study, numerical and categorical variables). Deep Embedded Clustering (DEC) is a promising technique capable of managing extensive sets of variables and non-linear relationships. Nevertheless, DEC cannot adequately handle mixed datatypes. Therefore, we adapted DEC by replacing the autoencoder with an X-shaped variational autoencoder (XVAE) and optimising hyperparameters for cluster stability. We call this model "X-DEC". We compared DEC and X-DEC by reproducing a previous study that used DEC to identify clusters in a population of intensive care patients. We assessed internal validity based on cluster stability on the development dataset. Since generalisability of clustering models has insufficiently been validated on external populations, we assessed external validity by investigating cluster generalisability onto an external validation dataset. We concluded that both DEC and X-DEC resulted in clinically recognisable and generalisable clusters, but X-DEC produced much more stable clusters.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Diagram of the autoencoder used in the original DEC model. Each column of circles is a layer, and the circles are neurons, which essentially are some mathematical combinations of the neurons from the previous layers, except for the input layer, where they represent the original input variables. The solid lines indicate that all neurons can be used to compute all neurons in the subsequent layer. The dotted vertical lines indicate that some neurons are not displayed to simplify the illustration. The numbers represent the number of each neuron. The autoencoder consists of two parts, the encoder (green box), which maps the input data onto a smaller latent feature space used for clustering, and the decoder (red box), which reconstructs the input variables from the latent features. The encoder contains one hidden layer of 64 neurons, which are different combinations of the 80 input variables. The encoder also contains an encoding layer, combining information from the previous 64 neurons in eight neurons (i.e., the latent features). The decoder contains one hidden layer and an output layer that attempts to reconstruct the input variables. Initially, all the connections between the neurons are random. Therefore, the output variables will not resemble the input variables very well. However, the autoencoder is trained by adjusting the weights of the connections between all neurons (i.e., neuron weights) such that the output variables will be as similar as possible to the input variables, as quantified by the mean squared error. It should be noted that the number of layers and the number of neurons in each layer can result in different mappings and thus influence the clustering results.
Figure 2
Figure 2
Architecture of the deep embedded clustering algorithm. Based on the input dataset, the autoencoder is initialised and maps the original variables into the latent features (step 1). K-means clustering is performed on the latent features (step 2). Then, six soft labels are computed for each patient sample, and the target distribution is calculated, maximising the separation of high and low soft labels (step 3). Subsequently, the encoder of the autoencoder is optimised to minimise the Kullback–Leibler divergence loss between the soft labels and target distribution over 140 iterations (step 4). If at least 1% of all patient samples change cluster membership, the soft labels and target distribution are recomputed, and the optimisation of the encoder of the autoencoder continues (step 5). Otherwise, clustering is finalised (step 6).
Figure 3
Figure 3
The X-shaped variational autoencoder architecture. Each column of nodes is a neural layer, and the circles it contains are neurons, which essentially are some non-linear mathematical combinations of the neurons from the previous layers. The solid lines show that all neurons can be used to compute all neurons in the next layer. The dotted vertical lines indicate that some neurons are not displayed to simplify the illustration. The numbers represent the number of each neuron. The X-shaped variational autoencoder (XVAE) consists of two main parts. First, the encoder (green box), which maps the input data into the smaller latent feature space, and the decoder (red box), which reconstructs the original variables from the latent features. The input data consists of two separate input sets, input S1 containing all numerical variables (blue box) and input S2 containing all categorical variables (orange box). Each input set is first fed into its own hidden layer. The resulting two hidden layers are then combined into another hidden layer that is fed into the encoding layer (green), which also generates the latent features on which the clustering is performed. The encoding layer uses stochastic inference to approximate the latent features as probability distributions, which in this case are Gaussian. Therefore, the encoding layer is separated into the mean and standard deviation of those distributions (not visualised). Next, the decoder starts where the encoding layer feeds into a hidden layer, which then splits into two separate hidden layers, each feeding into its own output layer to reconstruct the original variables. Finally, the reconstruction loss is determined by computing the mean squared error of the numerical variables and the cross-entropy of the categorical variables, which are both scaled by the number of variables of the input data.
Figure 4
Figure 4
Heatmap of the outcomes and admission diagnoses per cluster from the recreated DEC model. The colour scale, as depicted on the right-hand side of the figure, indicates what fraction of a given cluster belongs to the class specified on the y-axis. The in-cell numbers for ICU length of stay correspond to the mean per cluster in days. The values for ICU mortality and required vasoactive medication depict fraction of non-survivors. Values for diagnoses indicate total counts per cluster. The bar at the bottom shows the total number of patient samples per cluster. The bar at the right indicates the mean (for outcomes) and total number of patient samples (for admission diagnoses) per class across all clusters.
Figure 5
Figure 5
Stability plots of the recreated DEC model on the SICS data set. (A) A box-and-whisker plot of the Jaccard similarity coefficients per cluster. (B) A bar plot of the sample-wise stability, the y-axis indicates the number of samples in each bar, and the x-axis indicates the stability in terms of how often the samples were clustered into their reference cluster.
Figure 6
Figure 6
Colour map of generalisability cluster mappings between MUMC+ and SICS clusters from the recreated DEC model based on the input variables, outcomes variables, and the latent feature space. Each row corresponds to an MUMC+ cluster, each column to the variables used for mapping, and each cell colour and number to the SICS cluster it was mapped to, as indicated by the legend on the right.
Figure 7
Figure 7
Heatmap of the outcomes and admission diagnoses per cluster from the adjusted X-DEC model. The colour scale, as depicted on the right-hand side of the figure, indicates what fraction of a given cluster belongs to the class specified on the y-axis. The in-cell numbers for ICU length of stay correspond to the mean per cluster in days. The values for ICU mortality and required vasoactive medication depict fraction of non-survivors. Values for diagnoses indicate total counts per cluster. The bar at the bottom shows the total number of patient samples per cluster. The bar at the right indicates the mean (for outcomes) and the total number of patient samples (for admission diagnoses) per class across all clusters.
Figure 8
Figure 8
Stability plots of the adjusted X-DEC model on the SICS data set. (A) A box-and-whisker plot of the Jaccard similarity coefficients per cluster. (B) A bar plot of the sample-wise stability, the y-axis indicates the number of samples in each bar, and the x-axis indicates the stability in terms of how often the samples were clustered into their reference cluster.
Figure 9
Figure 9
Colour map of cluster mappings between MUMC+ and SICS clusters from the X-DEC model based on the input variables, outcomes variables, and the latent feature space. Each row corresponds to a MUMC+ cluster, each column to the variables used for mapping, and each cell colour and number to the SICS cluster it was mapped to, as indicated by the legend on the right.

References

    1. Castela-Forte J, Perner A, van der Horst ICC. The use of clustering algorithms in critical care research to unravel patient heterogeneity. Intens. Care Med. 2019;45:1025–1028. doi: 10.1007/s00134-019-05631-z. - DOI - PubMed
    1. Costa DK, Kahn JM. Organizing critical care for the 21st century. JAMA. 2016;315:751. doi: 10.1001/jama.2016.0974. - DOI - PubMed
    1. Castela Forte J, et al. Identifying and characterizing high-risk clusters in a heterogeneous ICU population with deep embedded clustering. Sci. Rep. 2021;11:12109. doi: 10.1038/s41598-021-91297-x. - DOI - PMC - PubMed
    1. Mousai O, et al. Clustering analysis of geriatric and acute characteristics in a cohort of very old patients on admission to ICU. Intens. Care Med. 2022 doi: 10.1007/s00134-022-06868-x. - DOI - PMC - PubMed
    1. Sweeney TE, et al. Unsupervised analysis of transcriptomics in bacterial sepsis across multiple datasets reveals three robust clusters. Crit. Care Med. 2018;46:915–925. doi: 10.1097/CCM.0000000000003084. - DOI - PMC - PubMed