Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 25;57(16):6573-6588.
doi: 10.1021/acs.est.3c00648. Epub 2023 Apr 11.

Data-Driven Quantitative Structure-Activity Relationship Modeling for Human Carcinogenicity by Chronic Oral Exposure

Affiliations

Data-Driven Quantitative Structure-Activity Relationship Modeling for Human Carcinogenicity by Chronic Oral Exposure

Elena Chung et al. Environ Sci Technol. .

Abstract

Traditional methodologies for assessing chemical toxicity are expensive and time-consuming. Computational modeling approaches have emerged as low-cost alternatives, especially those used to develop quantitative structure-activity relationship (QSAR) models. However, conventional QSAR models have limited training data, leading to low predictivity for new compounds. We developed a data-driven modeling approach for constructing carcinogenicity-related models and used these models to identify potential new human carcinogens. To this goal, we used a probe carcinogen dataset from the US Environmental Protection Agency's Integrated Risk Information System (IRIS) to identify relevant PubChem bioassays. Responses of 25 PubChem assays were significantly relevant to carcinogenicity. Eight assays inferred carcinogenicity predictivity and were selected for QSAR model training. Using 5 machine learning algorithms and 3 types of chemical fingerprints, 15 QSAR models were developed for each PubChem assay dataset. These models showed acceptable predictivity during 5-fold cross-validation (average CCR = 0.71). Using our QSAR models, we can correctly predict and rank 342 IRIS compounds' carcinogenic potentials (PPV = 0.72). The models predicted potential new carcinogens, which were validated by a literature search. This study portends an automated technique that can be applied to prioritize potential toxicants using validated QSAR models based on extensive training sets from public data resources.

Keywords: big data; carcinogens; data mining; machine learning; models; quantitative structure−activity relationships.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
Schematic QSAR modeling workflow used in this study to model the carcinogenic potential of chemicals. The workflow consists of three steps: (1) training set generations by automatic bioprofiling, (2) QSAR model developments, and (3) virtual screening. Created with BioRender.com.
Figure 2
Figure 2
Bioprofile of 342 IRIS compounds consisting of 1971 PubChem bioassays. The heat map with hierarchical clustering aggregates testing results from each bioassay as “active” (red), “inactive” (blue), and “inconclusive/untested” (gray).
Figure 3
Figure 3
Performance of the consensus QSAR models developed for eight PubChem bioassays. Statistical evaluation of the consensus QSAR model was constructed by averaging 5-fold cross-validation prediction values from individual models, including the sensitivity (eq 1), specificity (eq 2), CCR (eq 3), and PPV (eq 4).
Figure 4
Figure 4
Ranking probe compounds in the IRIS dataset using carcinogenicity probability (eq 5) results. Red triangles represent human carcinogens associated with oral exposures (N = 59), and blue circles represent noncarcinogens (N = 283). The red dotted lines represent the applicability domain (AD) cut-offs. The gray dashed line represents the default threshold value of 0.5 to classify carcinogens/noncarcinogens based on the prediction values.
Figure 5
Figure 5
Distribution of compounds for five external screening datasets by the proportions of active predictions to total results from QSAR models (top) and the proportions of potential carcinogens to the total number of compounds (bottom).

References

    1. Kindilien S.; Goldberg E. Household Tobacco Smoke Exposure and Acrylonitrile Metabolite Levels in a US Pediatric Sample. Environ. Toxicol. Pharmacol. 2021, 84, 103616.10.1016/j.etap.2021.103616. - DOI - PubMed
    1. Klaschka U. Dangerous Cosmetics - Criteria for Classification, Labelling and Packaging (EC 1272/2008) Applied to Personal Care Products. Environ. Sci. Eur. 2012, 24, 37.10.1186/2190-4715-24-37. - DOI
    1. Luechtefeld T.; Maertens A.; Russo D. P.; Rovida C.; Zhu H.; Hartung T. Global Analysis of Publicly Available Safety Data for 9,801 Substances Registered under REACH from 2008–2014. ALTEX 2016, 33, 95–109. 10.14573/altex.1510052. - DOI - PMC - PubMed
    1. National Toxicology Program . NTP Toxicology and Carcinogenesis Studies of C.I. Direct Blue 15 (CAS No. 2429-74-5) in F344 Rats (Drinking Water Studies). National Toxicology Program Technical Report Series, 1992; Vol. 397, pp 1–245. - PubMed
    1. Marselos M.; Tomatis L. Diethylstilboestrol: I, pharmacology, toxicology and carcinogenicity in humans. Eur. J. Cancer 1992, 28, 1182–1189. 10.1016/0959-8049(92)90482-h. - DOI - PubMed