Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets

Priyanka Banerjee¹, Frederic O Dehnbostel¹, Robert Preissner¹

Affiliations

PMID: 30271769
PMCID: PMC6149243
DOI: 10.3389/fchem.2018.00362

Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets

Priyanka Banerjee et al. Front Chem. 2018.

. 2018 Aug 28:6:362.

doi: 10.3389/fchem.2018.00362. eCollection 2018.

Authors

Priyanka Banerjee¹, Frederic O Dehnbostel¹, Robert Preissner¹

Affiliation

¹ Structural Bioinformatics Group, Institute for Physiology, Charité - University Medicine Berlin, Berlin, Germany.

PMID: 30271769
PMCID: PMC6149243
DOI: 10.3389/fchem.2018.00362

Abstract

Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.

Keywords: DILI; SMOTE; Tox21; imbalanced data; machine learning; molecular fingerprints; sampling methods; sensitivity-specificity balance.

PubMed Disclaimer

Figures

**Figure 1**
Schematic representation showing the design of maximum common feature (MCF) fingerprints using features derived from MACCS fingerprints.

**Figure 2**
Performance measures for cross-validation -AhR **(A)**, ER-LBD **(B)**, and HSE **(C)** models based on Random Forest Classifier and MACCS fingerprints.

**Figure 3**
Performance measures for external validation -AhR **(A)**, ER-LBD **(B)**, and HSE **(C)** models based on Random Forest Classifier and MACCS fingerprints.

**Figure 4**
Performance measures for cross-validation and external validation for DILI model based on Random Forest Classifier and MACCS fingerprints.

**Figure 5**
Chemical Space Networks of the actives of external test set (triangle) and actives of training set (square). The CSNs reveals that mexiletine (external test active) compound having similarity with phenoxypropazine (training set active) is incorrectly predicted by all the sampling methods as inactive. Similarily fipexide, sunitinib (external set active) are incorrectly predicted by all the sampling methods.

**Figure 6**
Chemical Space Networks of the actives of external test set (triangle) and inactives of training set (circle). The CSNs reveals that mexiletine (external test active) compound having similarity with isoetharine (training set inactive) is incorrectly predicted by all the sampling methods as inactive (false negative). Similarily fipexide, sunitinib (external set actives) are incorrectly predicted as inactive by all the sampling methods.

See this image and copyright information in PMC

References

1. Banerjee P., Eckert A. O., Schrey A. K., Preissner R. (2018). ProTox-II: a webserver for the prediction of toxicity of chemicals. Nucleic Acids Res. 46, W257–W263. 10.1093/nar/gky318 - DOI - PMC - PubMed
1. Banerjee P., Preissner R. (2018). BitterSweetForest : a random forest based binary classifier to predict bitterness and sweetness of chemical compounds. Front. Chem. 6:93. 10.3389/fchem.2018.00093 - DOI - PMC - PubMed
1. Banerjee P., Siramshetty V. B., Drwal M. N., Preissner R. (2016). Computational methods for prediction of in vitro effects of new chemical structures. J. Cheminform. 8, 1–11. 10.1186/s13321-016-0162-2 - DOI - PMC - PubMed
1. Beyan C., Fisher R. (2015). Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognit. 48, 1653–1672. 10.1016/j.patcog.2014.10.032 - DOI
1. Capuzzi S. J., Regina P., Isayev O., Farag S., Tropsha A. (2016). QSAR modeling of Tox21 challenge stress response and nuclear receptor signaling toxicity assays. Front. Environ. Sci. 4:3 10.3389/fenvs.2016.00003 - DOI

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets

Affiliation

Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous