Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 8;10(1):37.
doi: 10.1038/s41514-024-00163-3.

Precious2GPT: the combination of multiomics pretrained transformer and conditional diffusion for artificial multi-omics multi-species multi-tissue sample generation

Affiliations

Precious2GPT: the combination of multiomics pretrained transformer and conditional diffusion for artificial multi-omics multi-species multi-tissue sample generation

Denis Sidorenko et al. NPJ Aging. .

Abstract

Synthetic data generation in omics mimics real-world biological data, providing alternatives for training and evaluation of genomic analysis tools, controlling differential expression, and exploring data architecture. We previously developed Precious1GPT, a multimodal transformer trained on transcriptomic and methylation data, along with metadata, for predicting biological age and identifying dual-purpose therapeutic targets potentially implicated in aging and age-associated diseases. In this study, we introduce Precious2GPT, a multimodal architecture that integrates Conditional Diffusion (CDiffusion) and decoder-only Multi-omics Pretrained Transformer (MoPT) models trained on gene expression and DNA methylation data. Precious2GPT excels in synthetic data generation, outperforming Conditional Generative Adversarial Networks (CGANs), CDiffusion, and MoPT. We demonstrate that Precious2GPT is capable of generating representative synthetic data that captures tissue- and age-specific information from real transcriptomics and methylomics data. Notably, Precious2GPT surpasses other models in age prediction accuracy using the generated data, and it can generate data beyond 120 years of age. Furthermore, we showcase the potential of using this model in identifying gene signatures and potential therapeutic targets in a colorectal cancer case study.

PubMed Disclaimer

Conflict of interest statement

The authors are affiliated with Insilico Medicine, a commercial company developing and using generative artificial intelligence and other next-generation AI technologies and robotics for drug discovery, drug development, and aging research. Utilizing its generative AI platform and a range of deep aging clocks, Insilico Medicine has developed a portfolio of multiple therapeutic programs targeting fibrotic diseases, cancer, immunological diseases, and a range of age-related diseases.

Figures

Fig. 1
Fig. 1. Schematic representation of the P2GPT model.
The top left section of the diagram delineates the diverse omics datasets (e.g., methylation and gene expression) collected under various conditions such as age, tissue type, species, and omics types. From this initial data representation, lines branch out to indicate two separate data processing streams feeding into the CDiffusion. One stream enters the Categorical Embedding, processing discrete data features and the other enters the Continuous Embedding, handling the age data. Adjacent to these embedding blocks, the PyDeepInsight Transformation highlights another preparatory step for the input data, which is processed in parallel to the embeddings and also fed into the CDiffusion. On the left side, CDiffusion is presented in detail to reflect its centrality in the data analysis pipeline. Beneath this architecture, an Inverse PyDeepInsight block reverts the transformed data back to its omics representation after processing through the CDiffusion model. The transformed outcomes are combined with results from the CDiffusion in the FWLS block. The top right section of the figure introduces the Omics Tokenizer, serving as the preliminary stage for the LLM generation. Below the tokenizer, a larger visual represents the architecture of the LLM model. Its output is directed back into the omics space to broaden interpretability and also channeled into the FWLS, where it is integrated with the CDiffusion generations. The bottom right of the illustration showcases the Model Capabilities block. This block emphasizes various practical applications of the developed framework, including omics data generation, assembly of large open datasets, facilitation of control mechanisms for PandaOmics, the model’s capacity for target discovery, out-of-domain extrapolations, and conditional age prediction.
Fig. 2
Fig. 2. UMAP of real data and data generated by Precious2GPT.
Each point represents an individual sample. A Human expression data colored by data type (orange, real; blue, generated). B Human expression data colored by tissue type. C Human methylation data colored by data type (real or generated). D Human methylation data colored by tissue type. E Mouse expression data colored by data type (real or generated). F Mouse expression data colored by tissue type.
Fig. 3
Fig. 3. Analysis of underrepresented data synthesis using P2GPT.
Each point is a specific tissue, while the x-axis shows the number of samples presented in the dataset, and the y-axis shows the number of correctly generated samples out of the expected 300 samples. Left: Human expression. Middle: Human methylation. Right: Mouse expression.
Fig. 4
Fig. 4. Tissues with the most overlaps between real and generated data in differentially methylated genes were identified by 30 vs 80 years old comparisons.
gen generated.
Fig. 5
Fig. 5. Generation of methylation data based on real data and age.
A Distribution of age groups in real methylation data for each tissue. B PCA for blood with real and generated data with different age bins for models trained on data with the whole age distribution. C PCA for blood with real and generated data with different age bins for models trained on data with age lower than 80. The black line in PCA is connected by cluster centroids of each age group from [0,20] to [100,120] for real data and [100,120] for generated blood data.
Fig. 6
Fig. 6. Correlation matrix for colon carcinoma signatures.
Spearman correlation coefficients between colon carcinoma signatures were calculated using only landmark genes (A) and all restored genes (B). The colon carcinoma signature (PandaOmics CRC project signature) was derived from the “expression analysis” section of manually curated colon carcinoma meta-analysis in PandaOmics and corresponded to the combined gene expression changes values for colon carcinoma. P2GPT CRC signature was collected from the corresponding meta-analysis in PandaOmics.
Fig. 7
Fig. 7. Top 20 most promising target hypotheses for colorectal cancer.
Results were derived from the in silico Target ID scoring approach for PandaOmics colorectal carcinoma meta-analysis (A) and P2GPT colon cancer meta-analysis (B). To validate our approach, only omics-based scores with the application of a druggability filter were taken into account and used for the composition of the scores for ranking.

References

    1. Huang, L. et al. Deep Learning Methods for Omics Data Imputation. Biology12, 10.3390/biology12101313 (2023). - PMC - PubMed
    1. Lee, M. Recent Advances in Generative Adversarial Networks for Gene Expression Data: A Comprehensive Review. Mathematics11, 3055 (2023).10.3390/math11143055 - DOI
    1. Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. arxiv, 10.48550/arXiv.1712.06148 (2017).
    1. Lew, S., Solé-Casals, J., Caiafa, C. F. & Bau-Macià, J. A copula-based method for synthetic microarray data generation. In Barcelona Advances in Statistics, 10.13140/2.1.2281.9843 (2012).
    1. Yang, L. et al. Diffusion Models: A Comprehensive Survey of Methods and Applications. ACM Comput. Surv.56, 1–39 (2023).10.1145/3554729 - DOI

LinkOut - more resources