Applications of Deep-Learning in Exploiting Large-Scale and Heterogeneous Compound Data in Industrial Pharmaceutical Research

Laurianne David^{1

2}, Josep Arús-Pous^{1

3}, Johan Karlsson⁴, Ola Engkvist¹, Esben Jannik Bjerrum¹, Thierry Kogej¹, Jan M Kriegl⁵, Bernd Beck⁵, Hongming Chen^{1

6}

Affiliations

¹ Hit Discovery, Discovery Sciences, Biopharmaceutical R&D, AstraZeneca, Gothenburg, Sweden.
² Department of Life Science Informatics, B-IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany.
³ Department of Chemistry and Biochemistry, University of Bern, Bern, Switzerland.
⁴ Quantitative Biology, Discovery Sciences, Biopharmaceutical R&D, AstraZeneca, Gothenburg, Sweden.
⁵ Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany.
⁶ Chemistry and Chemical Biology Centre, Guangzhou Regenerative Medicine and Health - Guangdong Laboratory, Guangzhou, China.

PMID: 31749705
PMCID: PMC6848277
DOI: 10.3389/fphar.2019.01303

Review

Applications of Deep-Learning in Exploiting Large-Scale and Heterogeneous Compound Data in Industrial Pharmaceutical Research

Laurianne David et al. Front Pharmacol. 2019.

. 2019 Nov 5:10:1303.

doi: 10.3389/fphar.2019.01303. eCollection 2019.

Authors

Laurianne David^{1

2}, Josep Arús-Pous^{1

3}, Johan Karlsson⁴, Ola Engkvist¹, Esben Jannik Bjerrum¹, Thierry Kogej¹, Jan M Kriegl⁵, Bernd Beck⁵, Hongming Chen^{1

6}

Affiliations

¹ Hit Discovery, Discovery Sciences, Biopharmaceutical R&D, AstraZeneca, Gothenburg, Sweden.
² Department of Life Science Informatics, B-IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany.
³ Department of Chemistry and Biochemistry, University of Bern, Bern, Switzerland.
⁴ Quantitative Biology, Discovery Sciences, Biopharmaceutical R&D, AstraZeneca, Gothenburg, Sweden.
⁵ Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany.
⁶ Chemistry and Chemical Biology Centre, Guangzhou Regenerative Medicine and Health - Guangdong Laboratory, Guangzhou, China.

PMID: 31749705
PMCID: PMC6848277
DOI: 10.3389/fphar.2019.01303

Abstract

In recent years, the development of high-throughput screening (HTS) technologies and their establishment in an industrialized environment have given scientists the possibility to test millions of molecules and profile them against a multitude of biological targets in a short period of time, generating data in a much faster pace and with a higher quality than before. Besides the structure activity data from traditional bioassays, more complex assays such as transcriptomics profiling or imaging have also been established as routine profiling experiments thanks to the advancement of Next Generation Sequencing or automated microscopy technologies. In industrial pharmaceutical research, these technologies are typically established in conjunction with automated platforms in order to enable efficient handling of screening collections of thousands to millions of compounds. To exploit the ever-growing amount of data that are generated by these approaches, computational techniques are constantly evolving. In this regard, artificial intelligence technologies such as deep learning and machine learning methods play a key role in cheminformatics and bio-image analytics fields to address activity prediction, scaffold hopping, de novo molecule design, reaction/retrosynthesis predictions, or high content screening analysis. Herein we summarize the current state of analyzing large-scale compound data in industrial pharmaceutical research and describe the impact it has had on the drug discovery process over the last two decades, with a specific focus on deep-learning technologies.

Keywords: Artificial intelligence; Chemogenomics; Large-scale data; deep learning; pharmaceutical industry.

PubMed Disclaimer

Figures

**Figure 1**
Different categories of large-scale compound data in industrial pharmaceutical research.

**Figure 2**
Illustration of applying HTS-FP for building multi-task learning models. A chemogenomic matrix represents the interactions between the compound collection and a panel of biological target. Such a matrix is very often sparsely filled activities and missing cells represent unknown activity for the compound/target pair. Employing machine learning and HTSFP is an example of how unknown activities can be predicted.

**Figure 3**
Typical neural network architecture for image classification. Alternating convolutional and max pool layers are followed by a number of fully connected layers, and finally an output layer with either sigmoid or softmax functions, depending on the task (Gawehn et al., 2016).

**Figure 4**
Process of reaction prediction on an exemplary target molecule [lidocaine (Reilly, 2009)]. Machine-learning methods are applied to, first, predict the synthetic feasibility of the molecule and, second, predict the chemical context leading to the best yield possible for the reaction.

**Figure 5**
Canonical **(A)** and randomized **(B)** SMILES representations of Aspirin. Numbers represent the atom numberings assigned by the canonicalization algorithm **(A)** or randomized **(B)**. Green arrows indicate how the molecular graph is traversed. Both SMILES strings represent the same molecule but, as the atom numbering changes, the generated SMILES strings do too. Figure extracted with permission from Arús-Pous et al. (2019b).

**Figure 6**
Sampling process of a pre-trained recurrent neural network. The generation process starts with a GO token, and at each step, the model computes a probability distribution of all possible characters. Then, the next character is sampled from it and fed back to predict the next character. The internal memory in the long short-term memory (LSTM) cells enables the predictions to take previous characters into account when generating the next character.

See this image and copyright information in PMC

References

1. Agrafiotis D. K., Alex S., Dai H., Derkinderen A., Farnum M., Gates P., et al. (2007). Advanced Biological and Chemical Discovery (ABCD): centralizing discovery knowledge in an inherently decentralized world. J. Chem. Inf. Model. 47, 1999–2014. 10.1021/ci700267w - DOI - PubMed
1. Arús-Pous J., Blaschke T., Ulander S., Reymond J. L., Chen H., Engkvist O. (2019. a). Exploring the GDB-13 chemical space using deep generative models. J. Cheminform. 11, 20. 10.1186/s13321-019-0341-z - DOI - PMC - PubMed
1. Arús-Pous J., Johansson S., Ptykhodko O., Bjerrum E. J., Tyrchan C., Reymond J.-L. (2019. b). Randomized SMILES strings improve the quality of molecular generative models. ChemRxiv Prepr. Available at: https://chemrxiv.org/articles/Randomized_SMILES_Strings_Improve_the_Qual... [Accessed July 5, 2019]. 10.26434/chemrxiv.8639942.v2 - DOI - PMC - PubMed
1. Baell J. B., Holloway G. A. (2010). New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J. Med. Chem. 53, 2719–2740. 10.1021/jm901137j - DOI - PubMed
1. Baell J. B., Nissink J. W. M. (2018). Seven year itch: pan-assay interference compounds (PAINS) in 2017 - utility and limitations. ACS Chem. Biol. 13, 36–44. 10.1021/acschembio.7b00903 - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Applications of Deep-Learning in Exploiting Large-Scale and Heterogeneous Compound Data in Industrial Pharmaceutical Research

Affiliations

Applications of Deep-Learning in Exploiting Large-Scale and Heterogeneous Compound Data in Industrial Pharmaceutical Research

Authors

Affiliations

Abstract

Figures

References

Publication types

LinkOut - more resources

Full Text Sources