. 2025 Apr 6;16(1):3278.

doi: 10.1038/s41467-025-58314-3.

LEOPARD: missing view completion for multi-timepoint omics data via representation disentanglement and temporal knowledge transfer

Siyu Han^{1

2

3}, Shixiang Yu^{1

2

3}, Mengya Shi^{1

2

3}, Makoto Harada^{1

3}, Jianhong Ge^{1

2

3}, Jiesheng Lin^{4

5}, Cornelia Prehn⁶, Agnese Petrera⁶, Ying Li⁷, Flora Sam^{8

9}, Giuseppe Matullo¹⁰, Jerzy Adamski^{11

12

13}, Karsten Suhre^{14

15}, Christian Gieger^{4

16}, Stefanie M Hauck⁶, Christian Herder^{17

18

19}, Michael Roden^{17

18

19}, Francesco Paolo Casale^{20

21

22}, Na Cai^{2

21}, Annette Peters^{3

4

5

23}, Rui Wang-Sattler^{24

25

26}

Affiliations

¹ Institute of Translational Genomics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
² TUM School of Medicine and Health, Technical University of Munich, Munich, Germany.
³ German Center for Diabetes Research (DZD), Partner Neuherberg, Neuherberg, Germany.
⁴ Institute of Epidemiology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
⁵ Institute for Medical Information Processing, Biometry, and Epidemiology (IBE), Faculty of Medicine, Ludwig-Maximilians-Universität München, Pettenkofer School of Public Health, Munich, Germany.
⁶ Metabolomics and Proteomics Core, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
⁷ College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China.
⁸ Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN, USA.
⁹ Whitaker Cardiovascular Institute, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
¹⁰ Genomics Variation, Population Medicine and Complex Diseases Unit, Turin University, Turin, Italy.
¹¹ Institute of Experimental Genetics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
¹² Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
¹³ Institute of Biochemistry, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia.
¹⁴ Bioinformatics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar.
¹⁵ Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY, USA.
¹⁶ Research Unit of Molecular Epidemiology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
¹⁷ Institute for Clinical Diabetology, German Diabetes Center, Leibniz Center for Diabetes Research at Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.
¹⁸ German Center for Diabetes Research (DZD), Partner Düsseldorf, Neuherberg, Germany.
¹⁹ Department of Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.
²⁰ Institute of AI for Health, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
²¹ Helmholtz Pioneer Campus, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
²² School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
²³ Munich Heart Alliance, German Center for Cardiovascular Health (DZHK E.V., Partner-Site Munich), Munich, Germany.
²⁴ Institute of Translational Genomics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany. rui.wang-sattler@helmholtz-munich.de.
²⁵ German Center for Diabetes Research (DZD), Partner Neuherberg, Neuherberg, Germany. rui.wang-sattler@helmholtz-munich.de.
²⁶ Institute for Medical Information Processing, Biometry, and Epidemiology (IBE), Faculty of Medicine, Ludwig-Maximilians-Universität München, Pettenkofer School of Public Health, Munich, Germany. rui.wang-sattler@helmholtz-munich.de.

PMID: 40188173
PMCID: PMC11972361
DOI: 10.1038/s41467-025-58314-3

LEOPARD: missing view completion for multi-timepoint omics data via representation disentanglement and temporal knowledge transfer

Siyu Han et al. Nat Commun. 2025.

. 2025 Apr 6;16(1):3278.

doi: 10.1038/s41467-025-58314-3.

Authors

Affiliations

¹ Institute of Translational Genomics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
² TUM School of Medicine and Health, Technical University of Munich, Munich, Germany.
³ German Center for Diabetes Research (DZD), Partner Neuherberg, Neuherberg, Germany.
⁴ Institute of Epidemiology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
⁵ Institute for Medical Information Processing, Biometry, and Epidemiology (IBE), Faculty of Medicine, Ludwig-Maximilians-Universität München, Pettenkofer School of Public Health, Munich, Germany.
⁶ Metabolomics and Proteomics Core, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
⁷ College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China.
⁸ Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN, USA.
⁹ Whitaker Cardiovascular Institute, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
¹⁰ Genomics Variation, Population Medicine and Complex Diseases Unit, Turin University, Turin, Italy.
¹¹ Institute of Experimental Genetics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
¹² Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
¹³ Institute of Biochemistry, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia.
¹⁴ Bioinformatics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar.
¹⁵ Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY, USA.
¹⁶ Research Unit of Molecular Epidemiology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
¹⁷ Institute for Clinical Diabetology, German Diabetes Center, Leibniz Center for Diabetes Research at Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.
¹⁸ German Center for Diabetes Research (DZD), Partner Düsseldorf, Neuherberg, Germany.
¹⁹ Department of Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.
²⁰ Institute of AI for Health, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
²¹ Helmholtz Pioneer Campus, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
²² School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
²³ Munich Heart Alliance, German Center for Cardiovascular Health (DZHK E.V., Partner-Site Munich), Munich, Germany.
²⁴ Institute of Translational Genomics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany. rui.wang-sattler@helmholtz-munich.de.
²⁵ German Center for Diabetes Research (DZD), Partner Neuherberg, Neuherberg, Germany. rui.wang-sattler@helmholtz-munich.de.
²⁶ Institute for Medical Information Processing, Biometry, and Epidemiology (IBE), Faculty of Medicine, Ludwig-Maximilians-Universität München, Pettenkofer School of Public Health, Munich, Germany. rui.wang-sattler@helmholtz-munich.de.

PMID: 40188173
PMCID: PMC11972361
DOI: 10.1038/s41467-025-58314-3

Abstract

Longitudinal multi-view omics data offer unique insights into the temporal dynamics of individual-level physiology, which provides opportunities to advance personalized healthcare. However, the common occurrence of incomplete views makes extrapolation tasks difficult, and there is a lack of tailored methods for this critical issue. Here, we introduce LEOPARD, an innovative approach specifically designed to complete missing views in multi-timepoint omics data. By disentangling longitudinal omics data into content and temporal representations, LEOPARD transfers the temporal knowledge to the omics-specific content, thereby completing missing views. The effectiveness of LEOPARD is validated on four real-world omics datasets constructed with data from the MGH COVID study and the KORA cohort, spanning periods from 3 days to 14 years. Compared to conventional imputation methods, such as missForest, PMM, GLMM, and cGAN, LEOPARD yields the most robust results across the benchmark datasets. LEOPARD-imputed data also achieve the highest agreement with observed data in our analyses for age-associated metabolites detection, estimated glomerular filtration rate-associated proteins identification, and chronic kidney disease prediction. Our work takes the first step toward a generalized treatment of missing views in longitudinal omics data, enabling comprehensive exploration of temporal dynamics and providing valuable insights into personalized healthcare.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Problem description and overview of LEOPARD architecture.**
a An example of a missing view in a longitudinal multi-omics dataset. Here, some views at Timepoint T are absent. The observed views may contain additional missing data points. b An example of data density calculated from a variable in observed data (Timepoint 1 and Timepoint T) and imputed data. The data density indicates a distribution shift across the two timepoints. Imputation methods developed for cross-sectional data cannot account for the temporal changes within the data, and their imputation models built with data from one timepoint, such as Timepoint 1, might not be appropriate for inferring data from another timepoint, such as Timepoint T. c Compared to Raw data, data of Imputation 1 may exhibit lower MSE than data of Imputation 2, but Imputation 1 potentially lose biological variations present in the data. d The architecture of LEOPARD. Omics data from multiple timepoints are disentangled into omics-specific content representation and timepoint-specific temporal knowledge by the content and temporal encoders. The generator learns mappings between two views, while temporal knowledge is injected into content representation via the AdaIN operation. The multi-task discriminator encourages the distributions of reconstructed data to align more closely with the actual distribution. Contrastive loss enhances the representation learning process. Reconstruction loss measures the MSE between the input and reconstructed data. Representation loss stabilizes the training process by minimizing the MSE between the representations factorized from the reconstructed and actual data. Adversarial loss is incorporated to alleviate the element-wise averaging issue of the MSE loss. e the performance of LEOPARD is evaluated with percent bias and UMAP. The central line in the box plot represents the median. The box spans the interquartile range (IQR), and whiskers extend to values within 1.5 times the IQR. Data points outside this range are plotted as outliers. The two-sided paired Wilcoxon test is used to compare percent bias across methods. P-values are Bonferroni-adjusted, with significance denoted as: ns (not significant), * ( < 0.05), ** ( < 0.01), *** ( < 0.001). f several case studies, including both regression and classification analyses are performed to evaluate if biological information is preserved in the imputed data.

**Fig. 2. The representation disentanglement process of LEOPARD on the KORA multi-omics dataset.**
a The normalized temperature-scaled cross-entropy (NT-Xent)-based contrastive loss is computed for content and temporal representations. b–c Uniform manifold approximation and projection (UMAP) embeddings of content (b) and temporal (c) representations at various training epochs are visualized for the KORA multi-omics dataset’s validation set. Representations encoded from data of $v 1$ and $v 2$ (metabolomics and proteomics, depicted by blue and red dots) at timepoints $t 1$ and $t 2$ (S4 and F4, depicted by dark- and light-colored dots) are plotted. The data of $v 2$ at $t 2$ are imputed data produced after each training epoch, while the other data are from the observed samples in the validation set. LEOPARD’s content and temporal encoders capture signals unique to omics-specific content and temporal variations. In b, as the training progresses, one cluster is formed by the data of $v 2$ at $t 1$ and $t 2$ (dark and light red dots), while the other cluster is formed by the data of $v 1$ at $t 1$ and $t 2$ (dark and light blue dots), indicating that the content encoder is able to encode timepoint-invariant content representations. Similarly, in (c), embeddings from the same timepoint cluster together. One cluster is formed by the data of $v 1$ and $v 2$ at $t 1$ (dark blue and red dots), and the other is formed by the data of $v 1$ and $v 2$ at $t 2$ (light blue and red dots). This demonstrates that LEOPARD can effectively factorize omics data into content and temporal representations. Source data are provided as a Source Data file.

**Fig. 3. Percent bias of imputed results for the test sets of three benchmark datasets.**
Percent bias is evaluated on $D_{v = v 2, t = t 2}^{test}$ of the three benchmark datasets: MGH COVID proteomics dataset (upper row), KORA metabolomics dataset (middle row), and KORA multi-omics dataset (lower row), under various numbers of training observations (obsNum) from the data block to be completed. Please note that LM is used for imputation instead of GLMM when obsNum = 0. Each dot in the plots represents a percent bias value for a variable. The value below each box indicates the median, which is also represented by the central line in each box plot. The box extends from the first quartile to the third quartile, capturing the interquartile range (IQR). Whiskers extend to the smallest and largest values within 1.5 times the IQR from the quartile boundaries. Data points outside this range are plotted as outliers. The two-sided paired Wilcoxon test is used to compare percent bias across methods, with LEOPARD as the reference group. P-values are adjusted for multiple comparisons using the Bonferroni method, and significance is annotated based on cutpoints: not significant (ns), P < 0.05 (*), P < 0.01 (**), and P < 0.001 (***). Source data are provided as a Source Data file.

**Fig. 4. UMAP representations of the imputed values and corresponding observed data from the benchmark datasets.**
Uniform manifold approximation and projection (UMAP) models are initially fitted with the training data from the MGH COVID proteomics dataset (upper row, $t 1$ : D0, $t 2$ : D3), KORA metabolomics dataset (middle row, $t 1$ : F4, $t 2$ : FF4), and KORA multi-omics dataset (lower row, $t 1$ : S4, $t 2$ : F4). Subsequently, the trained models are applied to the corresponding observed data (represented by red and green dots for $t 1$ and $t 2$ ) and the data imputed by different methods (represented by blue dots) under the setting of obsNum = 100 for the MGH COVID dataset and obsNum = 200 for the two KORA-derived datasets. The distributions of red and green dots illustrate the variation across the two timepoints, while the similarity between the distributions of blue and green dots indicates the quality of the imputed data. A high degree of similarity suggests a strong resemblance between the imputed and observed data. Source data are provided as a Source Data file.

**Fig. 5. Proteins with low abundance tend to exhibit high percent bias in the imputed values.**
The proteins with low abundance (median concentration <4.0) tend to exhibit extremely high percent bias ( > 0.8) in the imputed values obtained under numbers of training observations (obsNum) is zero. The extremely high percent bias values of LEOPARD can be lowered by increasing obsNum. Please note that LM is used for imputation instead of GLMM when obsNum = 0. Source data are provided as a Source Data file.

**Fig. 6. Regression analyses with the data imputed by different methods.**
a Volcano plots display age-associated metabolites detected in the $D_{v = v 2, t = t 2}^{test}$ and ${\hat{D}}_{v = v 2, t = t 2}^{test}$ (obsNum = 0) of the KORA metabolomics dataset (N = 417). Associations are assessed using linear regression, with P-values adjusted for multiple comparisons via the Bonferroni method. 18 significant metabolites (P < 0.05/36) identified in the observed data are shown in blue. Replicated metabolites from the data imputed by different methods are marked with labels. Solid dots represent variables where the observed and imputed data have matching signs for the estimate, while hollow dots represent mismatched signs. b Volcano plots display eGFR-associated proteins detected in the $D_{v = v 2, t = t 2}^{test}$ and ${\hat{D}}_{v = v 2, t = t 2}^{test}$ (obsNum = 0) of the KORA multi-omics dataset (N = 212). Associations are also tested using linear regression with Bonferroni-adjusted P values. 28 significant metabolites (P < 0.05/66) identified in the observed data are shown in blue. Replicated metabolites from the data imputed by different methods are marked with labels. Solid dots indicate sign matches between the observed and imputed data, while hollow dots indicate mismatches. Source data are provided as a Source Data file.

**Fig. 7. Classification analyses with the data imputed by different methods.**
Chronic kidney disease (CKD) classification evaluated using $D_{v = v 2, t = t 2}^{test}$ and ${\hat{D}}_{v = v 2, t = t 2}^{test}$ (obsNum = 0) from (a) the KORA metabolomics dataset (N = 416, N_positive = 56, N_negative = 360) and (b) the KORA multi-omics dataset (N = 212, N_positive = 36, N_negative = 176). Models are trained using the balanced random forest (BRF) algorithm with identical hyperparameters and evaluated using leave-one-out-cross-validation (LOOCV). Evaluation metrics in the bar plot include accuracy (ACC), F1 score, true positive rate (TPR, also known as sensitivity), true negative rate (TNR, also known as specificity), and positive predictive value (PPV, also known as precision). The dashed lines in the ROC and PR curves represent the performance of a hypothetical model with no predictive capability. Source data are provided as a Source Data file.

**Fig. 8. Evaluation of minimum number of training samples required for LEOPARD.**
For each benchmark dataset, the average percent bias is evaluated on $D_{v = v 2, t = t 2}^{test}$ across 10 repeated completions for each combination of training sample sizes and numbers of training observations (obsNum). The bar indicates the median and the interquartile range (IQR) of the average percent bias values for different variables. In each repetition, the samples are selected randomly. Please note that the maximum obsNum cannot exceed the number of training samples, and the full training set of the MGH COVID proteomics dataset contains only 140 samples. Source data are provided as a Source Data file.

**Fig. 9. Performance of arbitrary temporal knowledge transfer.**
LEOPARD is evaluated on $D_{v = v 2, t = t 1}^{test}$ and $D_{v = v 1, t = t 3}^{test}$ from the Extended KORA metabolomics dataset. Timepoints $t 1$ , $t 2$ , and $t 3$ correspond to the KORA S4, F4, and FF4 studies, respectively. For each completion, LEOPARD is trained with the data from the other view at the same timepoint ( $D_{v = v 1, t = t 1}^{train}$ for $D_{v = v 2, t = t 1}^{test}$ and $D_{v = v 2, t = t 3}^{train}$ for $D_{v = v 1, t = t 3}^{test}$ ) along with varying obsNum and the data from one or two additional timepoints. For the same completion task, the evaluation shows that percent bias can be lowered by increasing obsNum or including additional timepoints into training. Each dot represents a percent bias value for a variable. Source data are provided as a Source Data file.

See this image and copyright information in PMC

References

1. Vasaikar, S. V. et al. A comprehensive platform for analyzing longitudinal multi-omics data. Nat. Commun.14, 1684 (2023). - PMC - PubMed
1. Avants, B. B., Tustison, N. J. & Stone, J. R. Similarity-driven multi-view embeddings from high-dimensional biomedical data. Nat. Comput. Sci.1, 143–152 (2021). - PMC - PubMed
1. Vandereyken, K., Sifrim, A., Thienpont, B. & Voet, T. Methods and applications for single-cell and spatial multi-omics. Nat. Rev. Genet. 1–22 10.1038/s41576-023-00580-2 (2023). - PMC - PubMed
1. Mitra, R. et al. Learning from data with structured missingness. Nat. Mach. Intell.5, 13–23 (2023).
1. Tarazona, S., Arzalluz-Luque, A. & Conesa, A. Undisclosed, unmet and neglected challenges in multi-omics studies. Nat. Comput. Sci.1, 395–402 (2021). - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

LEOPARD: missing view completion for multi-timepoint omics data via representation disentanglement and temporal knowledge transfer

Affiliations

LEOPARD: missing view completion for multi-timepoint omics data via representation disentanglement and temporal knowledge transfer

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources