Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 14;16(1):2534.
doi: 10.1038/s41467-025-56625-z.

Deep representation learning for clustering longitudinal survival data from electronic health records

Affiliations

Deep representation learning for clustering longitudinal survival data from electronic health records

Jiajun Qiu et al. Nat Commun. .

Abstract

Precision medicine requires accurate identification of clinically relevant patient subgroups. Electronic health records provide major opportunities for leveraging machine learning approaches to uncover novel patient subgroups. However, many existing approaches fail to adequately capture complex interactions between diagnosis trajectories and disease-relevant risk events, leading to subgroups that can still display great heterogeneity in event risk and underlying molecular mechanisms. To address this challenge, we implemented VaDeSC-EHR, a transformer-based variational autoencoder for clustering longitudinal survival data as extracted from electronic health records. We show that VaDeSC-EHR outperforms baseline methods on both synthetic and real-world benchmark datasets with known ground-truth cluster labels. In an application to Crohn's disease, VaDeSC-EHR successfully identifies four distinct subgroups with divergent diagnosis trajectories and risk profiles, revealing clinically and genetically relevant factors in Crohn's disease. Our results show that VaDeSC-EHR can be a powerful tool for discovering novel patient subgroups in the development of precision medicine approaches.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Architecture of VaDeSC-EHR.
First, an embedding is computed for a patient’s diagnosis sequence. This embedding serves as input for multiple transformer blocks, where the final pooling layer generates the latent representation Z for the patient. Z is regularized toward a Gaussian mixture distribution by including a variational term in the loss function. Z is then used to predict the time-to-event, as well as passed to the transformer decoder for reconstructing the patient’s diagnosis sequence. Weights are shared between the encoder and decoder. For more details, please refer to the “Methods” section.
Fig. 2
Fig. 2. The embedding structure.
Simulated example of a diagnosis sequence for a single individual with 8 diagnoses spread across 5 primary or hospital care visits (0–4) embedded at three levels of the ICD-10 ontology. Diag.: Diagnosis.
Fig. 3
Fig. 3. Generalization performance on synthetic benchmark data.
Comparison between VaDeSC-EHR and the other methods used for clustering longitudinal survival data. VaDeSC-EHR_nosurv represents VaDeSC-EHR trained without risk Loss. VaDeSC-EHR_relage represents VaDeSC-EHR trained with age information taken relative to the age at the start of the diagnosis sequence. The analyses are based on nested 5-fold cross-validations (n = 5). a Performance on retrieving the ground-truth clustering, in terms of the area under the receiver-operating characteristic (ROC), with p-values for the significance of the difference between VaDeSC-EHR and the other methods. Significance is assessed using a permutation test combined with 10,000 bootstrap iterations. b Performance on retrieving the ground-truth clustering in terms of balanced accuracy (ACC), with 0.5 for random performance. c Performance on time-to-event prediction, in terms of concordance index (CI), with 0.5 for random performance. Data are presented as mean values ± standard deviation.
Fig. 4
Fig. 4. Clustering CD patients from UK Biobank in their progression toward intestinal obstruction.
a UMAP (Uniform Manifold Approximation and Projection) projection of the latent representations of the CD patients, coloring patients by cluster (silhouette coefficient: 0.783). b Cluster-specific Kaplan–Meier curves with 95% confidence intervals. Lines denote mean values and shaded regions are 95% confidence intervals.
Fig. 5
Fig. 5. Association of CD clusters with smoking behavior.
The analyses are based on 1,908 CD patients (n = 1908). a Association of ever having smoked (UK Biobank data field: 20160) with risk of intestinal obstruction (p-values are Overall: 0.45, Cluster 1: 1.06 × 1007, Cluster 2: 1.23 × 1029, Cluster 3: 0.109, Cluster 4: 0.200) and b Association of nicotine dependence with risk of intestinal obstruction (ICD-10 code: F17.2) (p-values are Overall: 0.26, Cluster 1: 0.001, Cluster 2: 4.23 × 1009, Cluster 3: 0.09, Cluster 4: 0.99). Data are presented as mean values ± 95% confidence intervals. They are estimated using two-sided Cox proportional hazards regression models. And the asterisk represents the significance of the p-value < 0.05 (multivariate Cox regression). c Percentage of patients who ever smoked (p-value: 1.86 × 1005) and d Percentage of patients with nicotine dependence (ICD-10 code: F17.2) (p-value: 0.001). Significance is assessed using multinomial logistic regression with log-likelihood ratio test, which is a two-sided test.
Fig. 6
Fig. 6. Association of CD clusters with genetics.
The analyses are based on 1908 CD patients (n = 1908). a Pathway polygenic risk scores (pathway PRS) of the adaptive immune response pathway, comparing clusters 2 and 3 with clusters 1 and 4 (left), and cluster 3 with the other three clusters (right). The bounds of the box are defined by the lower quartile (25th percentile) and the upper quartile (75th percentile). The whiskers extend from the box and represent the data points that fall within 1.5 times the interquartile range (IQR) from the lower and upper quartiles. Any data point outside this range is considered an outlier and plotted separately. Significance is assessed using logistic regression with log-likelihood ratio test, which is a two-sided test. And multiple testing is corrected using the Benjamini–Hochberg procedure. b Enrichment of individual genetic variants used in the pathway PRS in clusters 1 and 4 relative to clusters 2 and 3, with the horizontal line indicating the significance level (FDR adjusted) p-value: 0.05. The variant highlighted in red is rs2523608. c Enrichment of individual genetic variants used in the pathway PRS in clusters 1, 2, and 4 compared to cluster 3, with the horizontal line indicating the significance level (FDR adjusted) p-value: 0.05. The variant highlighted in red is rs2523608. Significance is assessed using a two-sided Chi-square test and multiple testing is corrected using the Benjamini–Hochberg procedure.

References

    1. Electronic Public Health Reporting. ONC Annu. Meet. https://www.healthit.gov/sites/default/files/2018-12/ElectronicPublicHea... (2018).
    1. Parasrampuria, S. & Henry, J. Hospitals use of electronic health records data, 2015–2017. In ASTP Health IT Data Brief [Internet] 46 (Office of the Assistant Secretary for Technology Policy, Washington, DC, 2019). - PubMed
    1. Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit. Med.4, 86 (2021). - PMC - PubMed
    1. Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep.10, 7155 (2020). - PMC - PubMed
    1. Rongali, S. et al. Learning latent space representations to predict patient outcomes: model development and validation. J. Med. Internet Res.22, e16374 (2020). - PMC - PubMed

LinkOut - more resources