Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 18:4:757780.
doi: 10.3389/frai.2021.757780. eCollection 2021.

DeepCarc: Deep Learning-Powered Carcinogenicity Prediction Using Model-Level Representation

Affiliations

DeepCarc: Deep Learning-Powered Carcinogenicity Prediction Using Model-Level Representation

Ting Li et al. Front Artif Intell. .

Erratum in

Abstract

Carcinogenicity testing plays an essential role in identifying carcinogens in environmental chemistry and drug development. However, it is a time-consuming and label-intensive process to evaluate the carcinogenic potency with conventional 2-years rodent animal studies. Thus, there is an urgent need for alternative approaches to providing reliable and robust assessments on carcinogenicity. In this study, we proposed a DeepCarc model to predict carcinogenicity for small molecules using deep learning-based model-level representations. The DeepCarc Model was developed using a data set of 692 compounds and evaluated on a test set containing 171 compounds in the National Center for Toxicological Research liver cancer database (NCTRlcdb). As a result, the proposed DeepCarc model yielded a Matthews correlation coefficient (MCC) of 0.432 for the test set, outperforming four advanced deep learning (DL) powered quantitative structure-activity relationship (QSAR) models with an average improvement rate of 37%. Furthermore, the DeepCarc model was also employed to screen the carcinogenicity potential of the compounds from both DrugBank and Tox21. Altogether, the proposed DeepCarc model could serve as an early detection tool (https://github.com/TingLi2016/DeepCarc) for carcinogenicity assessment.

Keywords: NCTRlcdb; QSAR; carcinogenicity; deep learning; non-animal models.

PubMed Disclaimer

Conflict of interest statement

RR is co-founder and co-director of ApconiX, an integrated toxicology and ion channel company that provides expert advice on non-clinical aspects of drug discovery and drug development to academia, industry, and not-for-profit organizations. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Overall workflow for the DeepCarc model including: (1) Data preparation. 863 compounds were split into training (554 compounds), development (138 compounds), and test (171 compounds) sets based on the Kennard-stone algorithm. (2) Base classifiers development. Five algorithms were used to develop the base classifiers from three different chemical representations, including Mol2vec, Mold2, and MACCS. Two base classifiers selection strategies were employed to select the optimized classifiers for meta classifier development. (3) Meta classifier development. With three chemical representations and two selection methods, six groups of base classifiers, including Mol2vec_supervised, Mol2vec_original, Mold2_supervised, were used Mold2_original, MACCS_supervised, and MACCS_original. The probability prediction from selected base classifiers was used to train the neural network. (4) Model evaluation. The DeepCarc model was evaluated on the independent test set.
FIGURE 2
FIGURE 2
The distribution of the pairwise Tanimoto coefficients calculated from Mol2vec, Mold2, and MACCS: The pink and green indicate that the pairwise Tanimoto coefficients were calculated from the carcinogenic molecules and noncarcinogenic molecules, respectively.
FIGURE 3
FIGURE 3
The performance of the developed DeepCarc models based on the proposed supervised base classifier selection strategy with the three chemical representations: the three chemical representations included Mol2vec, Mold2, and MACCS. (A): Seven performance metrics; (B): Area under the ROC curve.
FIGURE 4
FIGURE 4
Ensemble models performance on the test set.
FIGURE 5
FIGURE 5
The probability distribution of the DeepCarc prediction of the compounds from (A) DrugBank; (B) Tox21.

Similar articles

Cited by

References

    1. Bajusz D., Rácz A., Héberger K. (2015). Why Is Tanimoto index an Appropriate Choice for Fingerprint-Based Similarity Calculations? J. Cheminform 7, 20–13. 10.1186/s13321-015-0069-3 - DOI - PMC - PubMed
    1. Becht E., McInnes L., Healy J., Dutertre C.-A., Kwok I. W. H., Ng L. G., et al. (2019). Dimensionality Reduction for Visualizing Single-Cell Data Using UMAP. Nat. Biotechnol. 37, 38–44. 10.1038/nbt.4314 - DOI - PubMed
    1. Beger R. D., Young J. F., Fang H. (2004). Discriminant Function Analyses of Liver-specific Carcinogens. J. Chem. Inf. Comput. Sci. 44, 1107–1110. 10.1021/ci0342829 - DOI - PubMed
    1. Benigni R., Passerini L. (2002). Carcinogenicity of the Aromatic Amines: from Structure-Activity Relationships to Mechanisms of Action and Risk Assessment. Mutat. Research/Reviews Mutat. Res. 511, 191–206. 10.1016/s1383-5742(02)00008-x - DOI - PubMed
    1. Breiman L. (1996). Bagging Predictors. Mach Learn. 24, 123–140. 10.1007/bf00058655 - DOI