. 2022 Sep 20;23(5):bbac257.

doi: 10.1093/bib/bbac257.

AI for predicting chemical-effect associations at the chemical universe level-deepFPlearn

Jana Schor¹, Patrick Scheibe², Matthias Bernt¹, Wibke Busch³, Chih Lai⁴, Jörg Hackermüller^{1

5}

Affiliations

¹ Department Computational Biology, Helmholtz Centre for environmental research - UFZ, Permoserstr. 15, 04318 Leipzig, Saxony, Germany.
² Department of Neurophysics, Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstraβe 1a, 04103 Leipzig, Saxony, Germany.
³ Department of Bioanalytical Ecotoxicology, Helmholtz Centre for environmental research - UFZ, Permoserstr. 15, 04318 Leipzig, Saxony, Germany.
⁴ Graduate Program in Software & School of Engineering, University of St. Thomas, 2115 Summit Ave, St. Paul, MN 55105, Minnesota, USA.
⁵ Department of Computer Science, Leipzig University, Augustuspl. 10, 04109 Leipzig, Saxony, Germany.

PMID: 35849097
PMCID: PMC9487703
DOI: 10.1093/bib/bbac257

AI for predicting chemical-effect associations at the chemical universe level-deepFPlearn

Jana Schor et al. Brief Bioinform. 2022.

. 2022 Sep 20;23(5):bbac257.

doi: 10.1093/bib/bbac257.

Authors

Jana Schor¹, Patrick Scheibe², Matthias Bernt¹, Wibke Busch³, Chih Lai⁴, Jörg Hackermüller^{1

5}

Affiliations

¹ Department Computational Biology, Helmholtz Centre for environmental research - UFZ, Permoserstr. 15, 04318 Leipzig, Saxony, Germany.
² Department of Neurophysics, Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstraβe 1a, 04103 Leipzig, Saxony, Germany.
³ Department of Bioanalytical Ecotoxicology, Helmholtz Centre for environmental research - UFZ, Permoserstr. 15, 04318 Leipzig, Saxony, Germany.
⁴ Graduate Program in Software & School of Engineering, University of St. Thomas, 2115 Summit Ave, St. Paul, MN 55105, Minnesota, USA.
⁵ Department of Computer Science, Leipzig University, Augustuspl. 10, 04109 Leipzig, Saxony, Germany.

PMID: 35849097
PMCID: PMC9487703
DOI: 10.1093/bib/bbac257

Abstract

Many chemicals are present in our environment, and all living species are exposed to them. However, numerous chemicals pose risks, such as developing severe diseases, if they occur at the wrong time in the wrong place. For the majority of the chemicals, these risks are not known. Chemical risk assessment and subsequent regulation of use require efficient and systematic strategies. Lab-based methods-even if high throughput-are too slow to keep up with the pace of chemical innovation. Existing computational approaches are designed for specific chemical classes or sub-problems but not usable on a large scale. Further, the application range of these approaches is limited by the low amount of available labeled training data. We present the ready-to-use and stand-alone program deepFPlearn that predicts the association between chemical structures and effects on the gene/pathway level using a combined deep learning approach. deepFPlearn uses a deep autoencoder for feature reduction before training a deep feed-forward neural network to predict the target association. We received good prediction qualities and showed that our feature compression preserves relevant chemical structural information. Using a vast chemical inventory (unlabeled data) as input for the autoencoder did not reduce our prediction quality but allowed capturing a much more comprehensive range of chemical structures. We predict meaningful-experimentally verified-associations of chemicals and effects on unseen data. deepFPlearn classifies hundreds of thousands of chemicals in seconds. We provide deepFPlearn as an open-source and flexible tool that can be easily retrained and customized to different application settings at https://github.com/yigbt/deepFPlearn.

Keywords: Deep learning; autoencoder; binary fingerprint; molecular structures; toxicology.

PubMed Disclaimer

Figures

**Figure 1**
The deepFPlearn workflow. **(A)** The molecular fingerprints serve as input for the neural networks. **(B)** An AE is used to compress the fingerprints. **(C)** An FNN) is used for direct classification of the input. **(D)** An FNN is used for classification of the compressed input. Sizes of layers, activation and loss functions are different for each network and depend on the input size, see methods section.

**Figure 2**
**(A)** ROC-AUC and loss values during training (calculated on the training and validation data after each epoch) of the *specific* ( – Sun et al. 2019) and the *generic* ( – CompTox) autoencoder. The training stopped early at 28 epochs for the specific AE—due to the small number of available training samples and reached a validation loss of 0.026. The training of the generic AE stopped at 320 epochs reaching a validation loss of 0.159. **(B)** UMAP visualizations of uncompressed and compressed representations of all compounds from dataset; the color indicates cluster assignment of a -means clustering with on the uncompressed features.

formula image — **Figure 2**
**(A)** ROC-AUC and loss values during training (calculated on the training and validation data after each epoch) of the *specific* ( – Sun et al. 2019) and the *generic* ( – CompTox) autoencoder. The training stopped early at 28 epochs for the specific AE—due to the small number of available training samples and reached a validation loss of 0.026. The training of the generic AE stopped at 320 epochs reaching a validation loss of 0.159. **(B)** UMAP visualizations of uncompressed and compressed representations of all compounds from dataset; the color indicates cluster assignment of a -means clustering with on the uncompressed features.

**Figure 3**
**(A)** Training histories of the feed forward neural networks stratified by the selected targets/models for androgen (AR) and estrogen (ER) receptors, and endocrine disruption (ED), and the degree of feature compression (uncompressed, specific AE, and generic AE); the shown metrics are ROC-AUC (red), loss (orange) calculated on the training (dotted) and validation data (solid) during training. **(B)** Comparison of the values of balanced accuracy (Balanced ACC), area under the receiver-operator curve (AUC), precision (PREC), recall (REC), F1 score (F1), specificity (SPEC) and MCC of the individual models using no (lightgray), the specific (medium gray) and the generic AE (dark gray). **(C)** MCC was calculated for increasing thresholds from 0 to 1 on the predicted validation data. The threshold with maximum MCC was selected as the individual classification threshold for each model. Example generated for model: AR, uncompressed input.

**Figure 4**
Receiver-operator (left of both panels) and precision-recall (right of both panels) curves of a single fold of the AR target without using feature compression **(A)**, and with generic feature compression **(B)**. The color indicates the value of the respective classification threshold. Supplemental Figure S2 depicts the standard deviations of the AUC for the five folds.

**Figure 5**
**(A)** Values for all metrics calculated on the validation data for the benchmarking data sets SIDER and Tox21 summarized across all targets: balanced accuracy (Balanced ACC), area under the receiver operator curve (AUC), precision (PREC), recall (REC), F1 score (F1), specificity (SPEC) and MCC of the individual models using no (light gray), the specific (medium gray) and the generic AE (dark gray). **(B)**deepFPlearn prediction probabilities using the ED model with generic AE on the compounds that have been experimentally measured for quantified target association and, respectively, differentiated into selective and non-specifically acting compounds by Escher et al. [11]. Probability distributions are compared using the Kolmogorow–Smirnow test, and the significance levels for rejecting the null hypotheses that both distributions are similar was ^* for P-values below 0.05. **(C)** Comparison of the counts of predicted 1 (active) and 0 (inactive) labels for the same compounds as described in Figure B shown for the ED model.

See this image and copyright information in PMC

References

1. Classification on imbalanced data , 2022. URL https://www.tensorflow.org/tutorials/structured_data/imbalanced_data (18 April 2022, date last accessed). Online tutorial.
1. Abadi M, Barham P, Chen J, et al. TensorFlow: A system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016. USENIX Association, Savannah, GA, 2016, 265–83.
1. Anderson MG, Mcdonnell J, Ximing C, et al. The Challenge of Micropollutants in Aquatic Systems. Science 2006;313(August): 1072–7. - PubMed
1. Bento AP, Hersey A, Félix E, et al. An open source chemical structure curation pipeline using rdkit. J Chem 2020ISSN 17582946;12:9. 10.1186/s13321-020-00456-1. - DOI - PMC - PubMed
1. Biewald L. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com (19 April 2022, date last accessed).

Publication types

Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AI for predicting chemical-effect associations at the chemical universe level-deepFPlearn

Affiliations

AI for predicting chemical-effect associations at the chemical universe level-deepFPlearn

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous