Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug 28:6:362.
doi: 10.3389/fchem.2018.00362. eCollection 2018.

Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets

Affiliations

Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets

Priyanka Banerjee et al. Front Chem. .

Abstract

Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.

Keywords: DILI; SMOTE; Tox21; imbalanced data; machine learning; molecular fingerprints; sampling methods; sensitivity-specificity balance.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic representation showing the design of maximum common feature (MCF) fingerprints using features derived from MACCS fingerprints.
Figure 2
Figure 2
Performance measures for cross-validation -AhR (A), ER-LBD (B), and HSE (C) models based on Random Forest Classifier and MACCS fingerprints.
Figure 3
Figure 3
Performance measures for external validation -AhR (A), ER-LBD (B), and HSE (C) models based on Random Forest Classifier and MACCS fingerprints.
Figure 4
Figure 4
Performance measures for cross-validation and external validation for DILI model based on Random Forest Classifier and MACCS fingerprints.
Figure 5
Figure 5
Chemical Space Networks of the actives of external test set (triangle) and actives of training set (square). The CSNs reveals that mexiletine (external test active) compound having similarity with phenoxypropazine (training set active) is incorrectly predicted by all the sampling methods as inactive. Similarily fipexide, sunitinib (external set active) are incorrectly predicted by all the sampling methods.
Figure 6
Figure 6
Chemical Space Networks of the actives of external test set (triangle) and inactives of training set (circle). The CSNs reveals that mexiletine (external test active) compound having similarity with isoetharine (training set inactive) is incorrectly predicted by all the sampling methods as inactive (false negative). Similarily fipexide, sunitinib (external set actives) are incorrectly predicted as inactive by all the sampling methods.

References

    1. Banerjee P., Eckert A. O., Schrey A. K., Preissner R. (2018). ProTox-II: a webserver for the prediction of toxicity of chemicals. Nucleic Acids Res. 46, W257–W263. 10.1093/nar/gky318 - DOI - PMC - PubMed
    1. Banerjee P., Preissner R. (2018). BitterSweetForest : a random forest based binary classifier to predict bitterness and sweetness of chemical compounds. Front. Chem. 6:93. 10.3389/fchem.2018.00093 - DOI - PMC - PubMed
    1. Banerjee P., Siramshetty V. B., Drwal M. N., Preissner R. (2016). Computational methods for prediction of in vitro effects of new chemical structures. J. Cheminform. 8, 1–11. 10.1186/s13321-016-0162-2 - DOI - PMC - PubMed
    1. Beyan C., Fisher R. (2015). Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognit. 48, 1653–1672. 10.1016/j.patcog.2014.10.032 - DOI
    1. Capuzzi S. J., Regina P., Isayev O., Farag S., Tropsha A. (2016). QSAR modeling of Tox21 challenge stress response and nuclear receptor signaling toxicity assays. Front. Environ. Sci. 4:3 10.3389/fenvs.2016.00003 - DOI