Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 20;23(5):bbac257.
doi: 10.1093/bib/bbac257.

AI for predicting chemical-effect associations at the chemical universe level-deepFPlearn

Affiliations

AI for predicting chemical-effect associations at the chemical universe level-deepFPlearn

Jana Schor et al. Brief Bioinform. .

Abstract

Many chemicals are present in our environment, and all living species are exposed to them. However, numerous chemicals pose risks, such as developing severe diseases, if they occur at the wrong time in the wrong place. For the majority of the chemicals, these risks are not known. Chemical risk assessment and subsequent regulation of use require efficient and systematic strategies. Lab-based methods-even if high throughput-are too slow to keep up with the pace of chemical innovation. Existing computational approaches are designed for specific chemical classes or sub-problems but not usable on a large scale. Further, the application range of these approaches is limited by the low amount of available labeled training data. We present the ready-to-use and stand-alone program deepFPlearn that predicts the association between chemical structures and effects on the gene/pathway level using a combined deep learning approach. deepFPlearn uses a deep autoencoder for feature reduction before training a deep feed-forward neural network to predict the target association. We received good prediction qualities and showed that our feature compression preserves relevant chemical structural information. Using a vast chemical inventory (unlabeled data) as input for the autoencoder did not reduce our prediction quality but allowed capturing a much more comprehensive range of chemical structures. We predict meaningful-experimentally verified-associations of chemicals and effects on unseen data. deepFPlearn classifies hundreds of thousands of chemicals in seconds. We provide deepFPlearn as an open-source and flexible tool that can be easily retrained and customized to different application settings at https://github.com/yigbt/deepFPlearn.

Keywords: Deep learning; autoencoder; binary fingerprint; molecular structures; toxicology.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The deepFPlearn workflow. (A) The molecular fingerprints serve as input for the neural networks. (B) An AE is used to compress the fingerprints. (C) An FNN) is used for direct classification of the input. (D) An FNN is used for classification of the compressed input. Sizes of layers, activation and loss functions are different for each network and depend on the input size, see methods section.
Figure 2
Figure 2
(A) ROC-AUC and loss values during training (calculated on the training and validation data after each epoch) of the specific (formula image – Sun et al. 2019) and the generic (formula image – CompTox) autoencoder. The training stopped early at 28 epochs for the specific AE—due to the small number of available training samples and reached a validation loss of 0.026. The training of the generic AE stopped at formula image320 epochs reaching a validation loss of 0.159. (B) UMAP visualizations of uncompressed and compressed representations of all compounds from formula image dataset; the color indicates cluster assignment of a formula image-means clustering with formula image on the uncompressed features.
Figure 3
Figure 3
(A) Training histories of the feed forward neural networks stratified by the selected targets/models for androgen (AR) and estrogen (ER) receptors, and endocrine disruption (ED), and the degree of feature compression (uncompressed, specific AE, and generic AE); the shown metrics are ROC-AUC (red), loss (orange) calculated on the training (dotted) and validation data (solid) during training. (B) Comparison of the values of balanced accuracy (Balanced ACC), area under the receiver-operator curve (AUC), precision (PREC), recall (REC), F1 score (F1), specificity (SPEC) and MCC of the individual models using no (lightgray), the specific (medium gray) and the generic AE (dark gray). (C) MCC was calculated for increasing thresholds from 0 to 1 on the predicted validation data. The threshold with maximum MCC was selected as the individual classification threshold for each model. Example generated for model: AR, uncompressed input.
Figure 4
Figure 4
Receiver-operator (left of both panels) and precision-recall (right of both panels) curves of a single fold of the AR target without using feature compression (A), and with generic feature compression (B). The color indicates the value of the respective classification threshold. Supplemental Figure S2 depicts the standard deviations of the AUC for the five folds.
Figure 5
Figure 5
(A) Values for all metrics calculated on the validation data for the benchmarking data sets SIDER and Tox21 summarized across all targets: balanced accuracy (Balanced ACC), area under the receiver operator curve (AUC), precision (PREC), recall (REC), F1 score (F1), specificity (SPEC) and MCC of the individual models using no (light gray), the specific (medium gray) and the generic AE (dark gray). (B)deepFPlearn prediction probabilities using the ED model with generic AE on the compounds that have been experimentally measured for quantified target association and, respectively, differentiated into selective and non-specifically acting compounds by Escher et al. [11]. Probability distributions are compared using the Kolmogorow–Smirnow test, and the significance levels for rejecting the null hypotheses that both distributions are similar was * for P-values below 0.05. (C) Comparison of the counts of predicted 1 (active) and 0 (inactive) labels for the same compounds as described in Figure B shown for the ED model.

References

    1. Classification on imbalanced data , 2022. URL https://www.tensorflow.org/tutorials/structured_data/imbalanced_data (18 April 2022, date last accessed). Online tutorial.
    1. Abadi M, Barham P, Chen J, et al. . TensorFlow: A system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016. USENIX Association, Savannah, GA, 2016, 265–83.
    1. Anderson MG, Mcdonnell J, Ximing C, et al. . The Challenge of Micropollutants in Aquatic Systems. Science 2006;313(August): 1072–7. - PubMed
    1. Bento AP, Hersey A, Félix E, et al. . An open source chemical structure curation pipeline using rdkit. J Chem 2020ISSN 17582946;12:9. 10.1186/s13321-020-00456-1. - DOI - PMC - PubMed
    1. Biewald L. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com (19 April 2022, date last accessed).

Publication types