Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 Nov 5:10:1303.
doi: 10.3389/fphar.2019.01303. eCollection 2019.

Applications of Deep-Learning in Exploiting Large-Scale and Heterogeneous Compound Data in Industrial Pharmaceutical Research

Affiliations
Review

Applications of Deep-Learning in Exploiting Large-Scale and Heterogeneous Compound Data in Industrial Pharmaceutical Research

Laurianne David et al. Front Pharmacol. .

Abstract

In recent years, the development of high-throughput screening (HTS) technologies and their establishment in an industrialized environment have given scientists the possibility to test millions of molecules and profile them against a multitude of biological targets in a short period of time, generating data in a much faster pace and with a higher quality than before. Besides the structure activity data from traditional bioassays, more complex assays such as transcriptomics profiling or imaging have also been established as routine profiling experiments thanks to the advancement of Next Generation Sequencing or automated microscopy technologies. In industrial pharmaceutical research, these technologies are typically established in conjunction with automated platforms in order to enable efficient handling of screening collections of thousands to millions of compounds. To exploit the ever-growing amount of data that are generated by these approaches, computational techniques are constantly evolving. In this regard, artificial intelligence technologies such as deep learning and machine learning methods play a key role in cheminformatics and bio-image analytics fields to address activity prediction, scaffold hopping, de novo molecule design, reaction/retrosynthesis predictions, or high content screening analysis. Herein we summarize the current state of analyzing large-scale compound data in industrial pharmaceutical research and describe the impact it has had on the drug discovery process over the last two decades, with a specific focus on deep-learning technologies.

Keywords: Artificial intelligence; Chemogenomics; Large-scale data; deep learning; pharmaceutical industry.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Different categories of large-scale compound data in industrial pharmaceutical research.
Figure 2
Figure 2
Illustration of applying HTS-FP for building multi-task learning models. A chemogenomic matrix represents the interactions between the compound collection and a panel of biological target. Such a matrix is very often sparsely filled activities and missing cells represent unknown activity for the compound/target pair. Employing machine learning and HTSFP is an example of how unknown activities can be predicted.
Figure 3
Figure 3
Typical neural network architecture for image classification. Alternating convolutional and max pool layers are followed by a number of fully connected layers, and finally an output layer with either sigmoid or softmax functions, depending on the task (Gawehn et al., 2016).
Figure 4
Figure 4
Process of reaction prediction on an exemplary target molecule [lidocaine (Reilly, 2009)]. Machine-learning methods are applied to, first, predict the synthetic feasibility of the molecule and, second, predict the chemical context leading to the best yield possible for the reaction.
Figure 5
Figure 5
Canonical (A) and randomized (B) SMILES representations of Aspirin. Numbers represent the atom numberings assigned by the canonicalization algorithm (A) or randomized (B). Green arrows indicate how the molecular graph is traversed. Both SMILES strings represent the same molecule but, as the atom numbering changes, the generated SMILES strings do too. Figure extracted with permission from Arús-Pous et al. (2019b).
Figure 6
Figure 6
Sampling process of a pre-trained recurrent neural network. The generation process starts with a GO token, and at each step, the model computes a probability distribution of all possible characters. Then, the next character is sampled from it and fed back to predict the next character. The internal memory in the long short-term memory (LSTM) cells enables the predictions to take previous characters into account when generating the next character.

References

    1. Agrafiotis D. K., Alex S., Dai H., Derkinderen A., Farnum M., Gates P., et al. (2007). Advanced Biological and Chemical Discovery (ABCD): centralizing discovery knowledge in an inherently decentralized world. J. Chem. Inf. Model. 47, 1999–2014. 10.1021/ci700267w - DOI - PubMed
    1. Arús-Pous J., Blaschke T., Ulander S., Reymond J. L., Chen H., Engkvist O. (2019. a). Exploring the GDB-13 chemical space using deep generative models. J. Cheminform. 11, 20. 10.1186/s13321-019-0341-z - DOI - PMC - PubMed
    1. Arús-Pous J., Johansson S., Ptykhodko O., Bjerrum E. J., Tyrchan C., Reymond J.-L. (2019. b). Randomized SMILES strings improve the quality of molecular generative models. ChemRxiv Prepr. Available at: https://chemrxiv.org/articles/Randomized_SMILES_Strings_Improve_the_Qual... [Accessed July 5, 2019]. 10.26434/chemrxiv.8639942.v2 - DOI - PMC - PubMed
    1. Baell J. B., Holloway G. A. (2010). New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J. Med. Chem. 53, 2719–2740. 10.1021/jm901137j - DOI - PubMed
    1. Baell J. B., Nissink J. W. M. (2018). Seven year itch: pan-assay interference compounds (PAINS) in 2017 - utility and limitations. ACS Chem. Biol. 13, 36–44. 10.1021/acschembio.7b00903 - DOI - PMC - PubMed

LinkOut - more resources