Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 3;11(1):10.
doi: 10.1038/s41467-019-13807-w.

De novo generation of hit-like molecules from gene expression signatures using artificial intelligence

Affiliations

De novo generation of hit-like molecules from gene expression signatures using artificial intelligence

Oscar Méndez-Lucio et al. Nat Commun. .

Abstract

Finding new molecules with a desired biological activity is an extremely difficult task. In this context, artificial intelligence and generative models have been used for molecular de novo design and compound optimization. Herein, we report a generative model that bridges systems biology and molecular design, conditioning a generative adversarial network with transcriptomic data. By doing so, we can automatically design molecules that have a high probability to induce a desired transcriptomic profile. As long as the gene expression signature of the desired state is provided, this model is able to design active-like molecules for desired targets without any previous target annotation of the training compounds. Molecules designed by this model are more similar to active compounds than the ones identified by similarity of gene expression signatures. Overall, this method represents an alternative approach to bridge chemistry and biology in the long and difficult road of drug discovery.

PubMed Disclaimer

Conflict of interest statement

D.A.C. and J.W. are employees of Bayer AG. O.M.L., B.B., and D.R. work directly or indirectly for Bayer SAS.

Figures

Fig. 1
Fig. 1. Graphical representation of the models and pipeline used in the study.
Molecules were encoded using a model that transforms the canonical SMILES of a molecule into a latent representation that can be later decoded into the set of grammar production rules needed to reconstruct the original SMILES (a). The generative adversarial network in b has a Stage I where the generator (G0 in blue) takes the desired gene expression signature together with a vector of random noise to produce a molecular representation that can be decoded into SMILES using the decoder (in red). The discriminator (D0 in purple) calculates the probability of the molecular representation to be a real molecule and the conditional network (f0 in green) calculates the probability of the molecular representation to match the gene expression signature. In Stage II, the generator (G1 in blue) takes as input the desired gene expression signature together with a molecular representation (e.g., the one produced by G0) to repeat the process. The general pipeline is represented in c where the generative adversarial network is trained with ~20 K compounds from the L1000 dataset (see Methods for details) to be able to generate compounds from a desired gene expression signature during the prediction phase.
Fig. 2
Fig. 2. Examples of generated molecules using a compound-induced gene expression signature.
a Distribution of number of valid and synthesizable molecules generated for each of the 31,821 gene expression signatures used in the 10-fold cross validation scheme. Results of Stage I are shown in blue and for Stage II in green. b Examples of generated molecules with their reference compound obtained for each cross-validation split and their respective Tanimoto similarity using Morgan fingerprints.
Fig. 3
Fig. 3. Molecules generated from target knock-out gene expression signatures.
a Distribution of similarity between all generated molecules and their closest active nearest neighbor using MACCS, Fraggle and Morgan Fingerprints for Stage I in blue and Stage II in green. b Chemical structures of some generated molecules and their closest active nearest neighbor for each of the ten different targets.
Fig. 4
Fig. 4. Examples of optimizing the benzene ring scaffold towards different targets using gene expression signatures.
a The encoder (in yellow) transforms the SMILES of the scaffold into a latent representation that is fed into the Stage II generator (G1 in blue) together with the desired gene expression signature. The output of G1 is the latent representation of an optimized molecule that can be decoded into a compound with a high probability to produce the gene expression signature. b Molecules generated by optimizing the benzene ring using the knock-out gene expression of AKT1, EGFR, ERG, and TP53 are shown inside the dotted circle and their closest active nearest neighbor outside the circle.
Fig. 5
Fig. 5. Benchmarking of conditioned generative adversarial network (GAN) with similarity-based search and non-conditioned models.
a Distribution of structural similarity scores between generated molecules or compounds selected from the training set using similarity search and their closest known active molecule. Conditioned GAN generated more active-like compounds than those found by similarity search using the gene expression signature of a target knock-out. b Comparison of conditioned GAN (light blue) with a non-conditioned GAN (blue) and a non-conditioned LSTM (green) to generate compounds for a specific target. The centerline of the boxplot represents the median; the bounds of the box represent the first and third quartile and the whiskers the 1.5 interquartile rage (IQR).

References

    1. Hert J, Irwin JJ, Laggner C, Keiser MJ, Shoichet BK. Quantifying biogenic bias in screening libraries. Nat. Chem. Biol. 2009;5:479–483. doi: 10.1038/nchembio.180. - DOI - PMC - PubMed
    1. Dobson CM. Chemical space and biology. Nature. 2004;432:824–828. doi: 10.1038/nature03192. - DOI - PubMed
    1. Bleicher KH, Böhm HJ, Müller K, Alanine AI. Hit and lead generation: beyond high-throughput screening. Nat. Rev. Drug Discov. 2003;2:369–378. doi: 10.1038/nrd1086. - DOI - PubMed
    1. Phatak SS, Stephan CC, Cavasotto CN. High-throughput and in silico screenings in drug discovery. Expert Opin. Drug Discov. 2009;4:947–959. doi: 10.1517/17460440903190961. - DOI - PubMed
    1. Paricharak S, et al. Data-driven approaches used for compound library design, hit triage and bioactivity modeling in high-throughput screening. Brief. Bioinform. 2018;19:277–285. - PMC - PubMed

MeSH terms

Substances