Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 6;16(1):3278.
doi: 10.1038/s41467-025-58314-3.

LEOPARD: missing view completion for multi-timepoint omics data via representation disentanglement and temporal knowledge transfer

Affiliations

LEOPARD: missing view completion for multi-timepoint omics data via representation disentanglement and temporal knowledge transfer

Siyu Han et al. Nat Commun. .

Abstract

Longitudinal multi-view omics data offer unique insights into the temporal dynamics of individual-level physiology, which provides opportunities to advance personalized healthcare. However, the common occurrence of incomplete views makes extrapolation tasks difficult, and there is a lack of tailored methods for this critical issue. Here, we introduce LEOPARD, an innovative approach specifically designed to complete missing views in multi-timepoint omics data. By disentangling longitudinal omics data into content and temporal representations, LEOPARD transfers the temporal knowledge to the omics-specific content, thereby completing missing views. The effectiveness of LEOPARD is validated on four real-world omics datasets constructed with data from the MGH COVID study and the KORA cohort, spanning periods from 3 days to 14 years. Compared to conventional imputation methods, such as missForest, PMM, GLMM, and cGAN, LEOPARD yields the most robust results across the benchmark datasets. LEOPARD-imputed data also achieve the highest agreement with observed data in our analyses for age-associated metabolites detection, estimated glomerular filtration rate-associated proteins identification, and chronic kidney disease prediction. Our work takes the first step toward a generalized treatment of missing views in longitudinal omics data, enabling comprehensive exploration of temporal dynamics and providing valuable insights into personalized healthcare.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Problem description and overview of LEOPARD architecture.
a An example of a missing view in a longitudinal multi-omics dataset. Here, some views at Timepoint T are absent. The observed views may contain additional missing data points. b An example of data density calculated from a variable in observed data (Timepoint 1 and Timepoint T) and imputed data. The data density indicates a distribution shift across the two timepoints. Imputation methods developed for cross-sectional data cannot account for the temporal changes within the data, and their imputation models built with data from one timepoint, such as Timepoint 1, might not be appropriate for inferring data from another timepoint, such as Timepoint T. c Compared to Raw data, data of Imputation 1 may exhibit lower MSE than data of Imputation 2, but Imputation 1 potentially lose biological variations present in the data. d The architecture of LEOPARD. Omics data from multiple timepoints are disentangled into omics-specific content representation and timepoint-specific temporal knowledge by the content and temporal encoders. The generator learns mappings between two views, while temporal knowledge is injected into content representation via the AdaIN operation. The multi-task discriminator encourages the distributions of reconstructed data to align more closely with the actual distribution. Contrastive loss enhances the representation learning process. Reconstruction loss measures the MSE between the input and reconstructed data. Representation loss stabilizes the training process by minimizing the MSE between the representations factorized from the reconstructed and actual data. Adversarial loss is incorporated to alleviate the element-wise averaging issue of the MSE loss. e the performance of LEOPARD is evaluated with percent bias and UMAP. The central line in the box plot represents the median. The box spans the interquartile range (IQR), and whiskers extend to values within 1.5 times the IQR. Data points outside this range are plotted as outliers. The two-sided paired Wilcoxon test is used to compare percent bias across methods. P-values are Bonferroni-adjusted, with significance denoted as: ns (not significant), * ( < 0.05), ** ( < 0.01), *** ( < 0.001). f several case studies, including both regression and classification analyses are performed to evaluate if biological information is preserved in the imputed data.
Fig. 2
Fig. 2. The representation disentanglement process of LEOPARD on the KORA multi-omics dataset.
a The normalized temperature-scaled cross-entropy (NT-Xent)-based contrastive loss is computed for content and temporal representations. bc Uniform manifold approximation and projection (UMAP) embeddings of content (b) and temporal (c) representations at various training epochs are visualized for the KORA multi-omics dataset’s validation set. Representations encoded from data of v1 and v2 (metabolomics and proteomics, depicted by blue and red dots) at timepoints t1 and t2 (S4 and F4, depicted by dark- and light-colored dots) are plotted. The data of v2 at t2 are imputed data produced after each training epoch, while the other data are from the observed samples in the validation set. LEOPARD’s content and temporal encoders capture signals unique to omics-specific content and temporal variations. In b, as the training progresses, one cluster is formed by the data of v2 at t1 and t2 (dark and light red dots), while the other cluster is formed by the data of v1 at t1 and t2 (dark and light blue dots), indicating that the content encoder is able to encode timepoint-invariant content representations. Similarly, in (c), embeddings from the same timepoint cluster together. One cluster is formed by the data of v1 and v2 at t1 (dark blue and red dots), and the other is formed by the data of v1 and v2 at t2 (light blue and red dots). This demonstrates that LEOPARD can effectively factorize omics data into content and temporal representations. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Percent bias of imputed results for the test sets of three benchmark datasets.
Percent bias is evaluated on Dv=v2,t=t2test of the three benchmark datasets: MGH COVID proteomics dataset (upper row), KORA metabolomics dataset (middle row), and KORA multi-omics dataset (lower row), under various numbers of training observations (obsNum) from the data block to be completed. Please note that LM is used for imputation instead of GLMM when obsNum = 0. Each dot in the plots represents a percent bias value for a variable. The value below each box indicates the median, which is also represented by the central line in each box plot. The box extends from the first quartile to the third quartile, capturing the interquartile range (IQR). Whiskers extend to the smallest and largest values within 1.5 times the IQR from the quartile boundaries. Data points outside this range are plotted as outliers. The two-sided paired Wilcoxon test is used to compare percent bias across methods, with LEOPARD as the reference group. P-values are adjusted for multiple comparisons using the Bonferroni method, and significance is annotated based on cutpoints: not significant (ns), P < 0.05 (*), P < 0.01 (**), and P < 0.001 (***). Source data are provided as a Source Data file.
Fig. 4
Fig. 4. UMAP representations of the imputed values and corresponding observed data from the benchmark datasets.
Uniform manifold approximation and projection (UMAP) models are initially fitted with the training data from the MGH COVID proteomics dataset (upper row, t1: D0, t2: D3), KORA metabolomics dataset (middle row, t1: F4, t2: FF4), and KORA multi-omics dataset (lower row, t1: S4, t2: F4). Subsequently, the trained models are applied to the corresponding observed data (represented by red and green dots for t1 and t2) and the data imputed by different methods (represented by blue dots) under the setting of obsNum = 100 for the MGH COVID dataset and obsNum = 200 for the two KORA-derived datasets. The distributions of red and green dots illustrate the variation across the two timepoints, while the similarity between the distributions of blue and green dots indicates the quality of the imputed data. A high degree of similarity suggests a strong resemblance between the imputed and observed data. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Proteins with low abundance tend to exhibit high percent bias in the imputed values.
The proteins with low abundance (median concentration <4.0) tend to exhibit extremely high percent bias ( > 0.8) in the imputed values obtained under numbers of training observations (obsNum) is zero. The extremely high percent bias values of LEOPARD can be lowered by increasing obsNum. Please note that LM is used for imputation instead of GLMM when obsNum = 0. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. Regression analyses with the data imputed by different methods.
a Volcano plots display age-associated metabolites detected in the Dv=v2,t=t2test and D^v=v2,t=t2test (obsNum = 0) of the KORA metabolomics dataset (N = 417). Associations are assessed using linear regression, with P-values adjusted for multiple comparisons via the Bonferroni method. 18 significant metabolites (P < 0.05/36) identified in the observed data are shown in blue. Replicated metabolites from the data imputed by different methods are marked with labels. Solid dots represent variables where the observed and imputed data have matching signs for the estimate, while hollow dots represent mismatched signs. b Volcano plots display eGFR-associated proteins detected in the Dv=v2,t=t2test and D^v=v2,t=t2test (obsNum = 0) of the KORA multi-omics dataset (N = 212). Associations are also tested using linear regression with Bonferroni-adjusted P values. 28 significant metabolites (P < 0.05/66) identified in the observed data are shown in blue. Replicated metabolites from the data imputed by different methods are marked with labels. Solid dots indicate sign matches between the observed and imputed data, while hollow dots indicate mismatches. Source data are provided as a Source Data file.
Fig. 7
Fig. 7. Classification analyses with the data imputed by different methods.
Chronic kidney disease (CKD) classification evaluated using Dv=v2,t=t2test and D^v=v2,t=t2test (obsNum = 0) from (a) the KORA metabolomics dataset (N = 416, Npositive = 56, Nnegative = 360) and (b) the KORA multi-omics dataset (N = 212, Npositive = 36, Nnegative = 176). Models are trained using the balanced random forest (BRF) algorithm with identical hyperparameters and evaluated using leave-one-out-cross-validation (LOOCV). Evaluation metrics in the bar plot include accuracy (ACC), F1 score, true positive rate (TPR, also known as sensitivity), true negative rate (TNR, also known as specificity), and positive predictive value (PPV, also known as precision). The dashed lines in the ROC and PR curves represent the performance of a hypothetical model with no predictive capability. Source data are provided as a Source Data file.
Fig. 8
Fig. 8. Evaluation of minimum number of training samples required for LEOPARD.
For each benchmark dataset, the average percent bias is evaluated on Dv=v2,t=t2test across 10 repeated completions for each combination of training sample sizes and numbers of training observations (obsNum). The bar indicates the median and the interquartile range (IQR) of the average percent bias values for different variables. In each repetition, the samples are selected randomly. Please note that the maximum obsNum cannot exceed the number of training samples, and the full training set of the MGH COVID proteomics dataset contains only 140 samples. Source data are provided as a Source Data file.
Fig. 9
Fig. 9. Performance of arbitrary temporal knowledge transfer.
LEOPARD is evaluated on Dv=v2,t=t1test and Dv=v1,t=t3test from the Extended KORA metabolomics dataset. Timepoints t1, t2, and t3 correspond to the KORA S4, F4, and FF4 studies, respectively. For each completion, LEOPARD is trained with the data from the other view at the same timepoint (Dv=v1,t=t1train for Dv=v2,t=t1test and Dv=v2,t=t3train for Dv=v1,t=t3test) along with varying obsNum and the data from one or two additional timepoints. For the same completion task, the evaluation shows that percent bias can be lowered by increasing obsNum or including additional timepoints into training. Each dot represents a percent bias value for a variable. Source data are provided as a Source Data file.

References

    1. Vasaikar, S. V. et al. A comprehensive platform for analyzing longitudinal multi-omics data. Nat. Commun.14, 1684 (2023). - PMC - PubMed
    1. Avants, B. B., Tustison, N. J. & Stone, J. R. Similarity-driven multi-view embeddings from high-dimensional biomedical data. Nat. Comput. Sci.1, 143–152 (2021). - PMC - PubMed
    1. Vandereyken, K., Sifrim, A., Thienpont, B. & Voet, T. Methods and applications for single-cell and spatial multi-omics. Nat. Rev. Genet. 1–22 10.1038/s41576-023-00580-2 (2023). - PMC - PubMed
    1. Mitra, R. et al. Learning from data with structured missingness. Nat. Mach. Intell.5, 13–23 (2023).
    1. Tarazona, S., Arzalluz-Luque, A. & Conesa, A. Undisclosed, unmet and neglected challenges in multi-omics studies. Nat. Comput. Sci.1, 395–402 (2021). - PubMed

LinkOut - more resources