Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 29;15(1):1853.
doi: 10.1038/s41467-024-46089-y.

Drug target prediction through deep learning functional representation of gene signatures

Affiliations

Drug target prediction through deep learning functional representation of gene signatures

Hao Chen et al. Nat Commun. .

Abstract

Many machine learning applications in bioinformatics currently rely on matching gene identities when analyzing input gene signatures and fail to take advantage of preexisting knowledge about gene functions. To further enable comparative analysis of OMICS datasets, including target deconvolution and mechanism of action studies, we develop an approach that represents gene signatures projected onto their biological functions, instead of their identities, similar to how the word2vec technique works in natural language processing. We develop the Functional Representation of Gene Signatures (FRoGS) approach by training a deep learning model and demonstrate that its application to the Broad Institute's L1000 datasets results in more effective compound-target predictions than models based on gene identities alone. By integrating additional pharmacological activity data sources, FRoGS significantly increases the number of high-quality compound-target predictions relative to existing approaches, many of which are supported by in silico and/or experimental evidence. These results underscore the general utility of FRoGS in machine learning-based bioinformatics applications. Prediction networks pre-equipped with the knowledge of gene functions may help uncover new relationships among gene signatures acquired by large-scale OMICs studies on compounds, cell types, disease models, and patient cohorts.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. FRoGS can extract weak pathway signals.
a Comparison between two hypothetical gene signatures. Only gene A and B in Pathway W are considered overlapped based on gene identity (top), similar to the use of one-hot encoding in NLP. Genes A-F contribute to signature overlap if all genes of the same functions W are considered (bottom), similar to the use of word2vec. b t-SNE projection of gene embedding vectors, where each marker represents a gene. Markers are colored by their top-level functions annotated in GO. c Each of the 460 Reactome pathways was used to simulate foreground gene signatures generated under varying signal levels with λ at 5, 10, 15, and 20. The separation between foreground-foreground and foreground-background pairs is defined as -log10(p) based on the one-sided Wilcoxon signed-rank test (n = 200 simulations). The larger the value, the more sensitive the method can separate the two types of signature pairs. Each pathway contributes to one data point in each box plot. Box-and-whisker plots show the median (center line), 25th, and 75th percentile (lower and upper boundary), with 1.5 × inter-quartile range indicated by whiskers and outliers shown as individual data points. Source data are provided as a Source Data file.
Fig. 2
Fig. 2. FRoGS model predicts compound-target associations.
a The neural network architecture predicting the probability of a compound binding to a target based on their L1000 gene set signature embeddings. b The comparison of multiple L1000-based prediction models. The FRoGS model performed the best and CMap score, OPA2Vec GO performed similarly to random models. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Model L predictions are supported by multiple orthogonal data sources.
a Validation categories for 2491 Model L predictions. b A “structure-likely” example for nitrendipine targeting CHRM3 is supported by its structural similarity to the reference analog nicardipine. cd Perphenazine targeting DRD1 and mephentermine targeting HRH1 are strongly supported by the pQSAR dataset. e Trametinib targeting MEK1/2 is supported by the NCI60 dataset. f Bortezomib targeting BSMP1 is supported by the PSP dataset. Both axes in (be) are normalized Z-score assay activities and in (f) are GI50 scores. Error bands in (be) represent the 95% confidence interval of the regression fit. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. The compound-target network predicted by the combined model.
a The network simplified by keeping ≤ 5 best-scoring targets per compound and ≤ 10 best-scoring compounds per target. Styles of nodes and edges are explained in the figure legend. b Network for L01X “other antineoplastic agents”. c Targets in L01X subnetwork show the strong enrichment of pathways related to tyrosine phosphorylation and kinase-related signal transduction processes. One-sided fisher exact test, statistics including multi-test adjusted p-values are in source data. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Experimental confirmation of predicted kinase inhibitors.
a Single-dose confirmation rate for predicted kinase inhibitors (black) versus those not predicted to bind (gray). b Dose-response confirmation rate for compounds validated by single-dose confirmation versus those not validated. Error bars, mean ± SD, one-sided chi-square test. ***p < 0.001. c The IC50 heatmap of 191 compounds across 9 kinases. True positives are in dark blue and false negatives are in dark yellow. Light blue indicates true negatives and light yellow is for false positives. Overall, 50% of compounds are selective (bound in ≤3 assays). Statistics are provided in Supplementary Table 2–3.
Fig. 6
Fig. 6. Validation of AhR binders.
a The confirmation results of 333 compounds predicted to target AhR. bj Dose-response data for selected AhR binders. Assays are color coded as blue for agonist, red for antagonist, and green for toxicity. X axes are concentrations in µM and Y axes are normalized response, with the agonist signals scaled for the ease of visual comparison.

References

    1. Subramanian A, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017;171:1437–1452.e17. doi: 10.1016/j.cell.2017.10.049. - DOI - PMC - PubMed
    1. Li Z, et al. In silico prediction of drug-target interaction networks based on drug chemical structure and protein sequences. Sci. Rep. 2017;7:11174. doi: 10.1038/s41598-017-10724-0. - DOI - PMC - PubMed
    1. Zhong F, et al. Drug target inference by mining transcriptional data using a novel graph convolutional network framework. Protein Cell. 2022;13:281–301. doi: 10.1007/s13238-021-00885-0. - DOI - PMC - PubMed
    1. Noh H, Shoemaker JE, Gunawan R. Network perturbation analysis of gene transcriptional profiles reveals protein targets and mechanism of action of drugs and influenza a viral infection. Nucleic Acids Res. 2018;46:e34. doi: 10.1093/nar/gkx1314. - DOI - PMC - PubMed
    1. Pabon NA, et al. Predicting protein targets for drug-like compounds using transcriptomics. PLoS Comput. Biol. 2018;14:e1006651. doi: 10.1371/journal.pcbi.1006651. - DOI - PMC - PubMed