Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 15;59(27):13844-13856.
doi: 10.1021/acs.est.5c04054. Epub 2025 Jun 27.

Machine Learning and Large Language Models for Modeling Complex Toxicity Pathways and Predicting Steroidogenesis

Affiliations

Machine Learning and Large Language Models for Modeling Complex Toxicity Pathways and Predicting Steroidogenesis

Thomas R Lane et al. Environ Sci Technol. .

Abstract

High-throughput screening and computational models have been effective in predicting chemical interactions with estrogen and androgen receptors, but similar approaches for steroidogenesis remain limited. To address this gap, we developed general steroidogenesis modulation models using data from ∼1,800 chemicals screened in H295R human adrenocortical carcinoma cells. A random forest model was validated using a prospective test set of 20 compounds (14 predicted active, 6 inactive), achieving 80% accuracy with conformal prediction adjustments. In parallel, we built classification and regression models based on IC50 data from ChEMBL for key steroidogenic enzymes, including CYP17A1, CYP21A2, CYP11B1, CYP11B2, 17β-HSD (1/2/3/5), 5α-reductase (1/2), and CYP19A1 (126-9,327 compounds per target). These models enable predictions of both general steroidogenesis inhibition and potential molecular targets. Additionally, we developed a transformer-based model (MolBART) to predict all end points simultaneously and validated this performance. Combined, these models may offer a rapid and scalable system for assessing chemical impacts on steroidogenesis, supporting chemical risk assessment, product stewardship, and regulatory decision-making.

Keywords: MolBART; conformal predictors; endocrine disruption; large language models; machine learning; steroidogenesis.

PubMed Disclaimer

Conflict of interest statement

Competing interests:

SE is CEO and Founder at Collaborations Pharmaceuticals, Inc. while TRL, FU, SHS are employees of this company. Other authors have no conflicts.

Figures

Figure 1.
Figure 1.
A schematic representation of the steroidogenesis pathway as exemplified in H295R Human Adrenocortical Carcinoma Cells . The steroid names, structures and the enzymes known to catalyze their conversion/interconversion are annotated.
Figure 2.
Figure 2.
5-fold cross-validation metrics for classification machine learning models for the modulation of steroidogenesis in the H295R model. Training dataset information (dataset size, classification distributions) is annotated (blue and grey text boxes). (A) Performance by algorithm is given numerically, with (B) example truth tables, ROC plots and probability-like scores histogram distributions examples shown for random forest and SVC models. For the histograms, red and blue bars represent the ground truth negative and positive classes, respectively. A 0.7 probability-like score is annotated on the rf model histogram highlighting the accuracy of the positive class at this threshold. Deep Learning (DL), AdaBoost decision trees (ada), Bernoulli naïve bayes (bnb), Bayesian Ridge regression (br), elastic net regression (enr), k-nearest neighbors (knn), support vector machine (svc), logistic regression (lr), xgboost (xgb) and random forest (rf).
Figure 3.
Figure 3.
Distributions of 5-fold cross-validation (CV) metrics and training set size/balance for all (A) classification and (B) regression models built for the steroidogenesis targets (Table 1). All classification models have a unified threshold of 100 nM. Balance is the fraction of the positive class. Receiver operator curve “area under the curve” (AUC), Accuracy (ACC), Recall, Specificity (Spec), Precision (Prec), F-1 score (F1), Matthews Correlation Coefficient (MCC) and Cohen’s kappa coefficient (Cohen’s κ), Mean absolute error (MAE), root mean squared error (RMSE), coefficient of determination (R2).
Figure 4.
Figure 4.
5-fold cross-validation metrics models for the inhibition (IC50) of two steroidogenesis targets, (A,B) steroid 5-α-reductase 2 and 17β-HSD2 (C,D). Training dataset information, such as dataset size, classification distributions or activity ranges are annotated below each title. Performance by algorithm is given numerically, with example truth tables, probability-like scores histogram distributions and plotted activity (predicted vs actual activity [-logM]) examples shown for random forest models. For the histograms, red and blue bars represent the ground truth negative and positive classes, respectively. Deep Learning (DL), AdaBoost decision trees (ada), Bernoulli naïve bayes (bnb), Bayesian Ridge regression (br), elastic net regression (enr), k-nearest neighbors (knn), support vector machine (svc), logistic regression (lr), xgboost (xgb) and random forest (rf).
Figure 5.
Figure 5.
Heatmap of the fold-change over the 1% DMSO controls for the compounds predicted as either active (A) or inactive (B) in our random forest steroidogenesis modulation model.
Figure 6.
Figure 6.
Inhibition prediction (IC50) of the modeled steroidogenesis targets for the example molecule pravastatin. The classification consensus is the based on the majority rule of 8 classification model (>4 agreement, =4 active) and the average prediction active of the regression models (−logM).
Figure 7.
Figure 7.
t-SNE plot of steroidogenesis training datasets and multiple industry-relevant products. (A) Colored as either high-throughput screen (HTS) for steroidogenesis or by steroidogenesis specific targets. (B) All steroidogenesis training data are labeled as Primary with additional labeled datasets of interest to various industries as defined by the EPA CompTox dashboard. t-SNE plots have the same coordinates for all steroidogenesis training data to show dataset overlap.

References

    1. Zorn KM; Foil DH; Lane TR; Hillwalker W; Feifarek DJ; Jones F; Klaren WD; Brinkman AM; Ekins S Comparing Machine Learning Models for Aromatase (P450 19A1). Environ Sci Technol 2020, 54 (23), 15546–15555. DOI: 10.1021/acs.est.0c05771. - DOI - PMC - PubMed
    1. Zorn KM; Foil DH; Lane TR; Hillwalker W; Feifarek DJ; Jones F; Klaren WD; Brinkman AM; Ekins S Comparison of Machine Learning Models for the Androgen Receptor. Environ Sci Technol 2020, 54 (21), 13690–13700. DOI: 10.1021/acs.est.0c03984. - DOI - PMC - PubMed
    1. Zorn KM; Foil DH; Lane TR; Russo DP; Hillwalker W; Feifarek DJ; Jones F; Klaren WD; Brinkman AM; Ekins S Machine Learning Models for Estrogen Receptor Bioactivity and Endocrine Disruption Prediction. Environ Sci Technol 2020, 54 (19), 12202–12213. DOI: 10.1021/acs.est.0c03982. - DOI - PMC - PubMed
    1. Russo DP; Zorn KM; Clark AM; Zhu H; Ekins S Comparing Multiple Machine Learning Algorithms and Metrics for Estrogen Receptor Binding Prediction. Mol Pharm 2018, 15 (10), 4361–4370. DOI: 10.1021/acs.molpharmaceut.8b00546. - DOI - PMC - PubMed
    1. Shanle EK; Xu W Endocrine disrupting chemicals targeting estrogen receptor signaling: identification and mechanisms of action. Chem Res Toxicol 2011, 24 (1), 6–19. DOI: 10.1021/tx100231n. - DOI - PMC - PubMed

LinkOut - more resources