Machine Learning and Large Language Models for Modeling Complex Toxicity Pathways and Predicting Steroidogenesis

Thomas R Lane¹, Patricia A Vignaux¹, Joshua S Harris¹, Scott H Snyder¹, Fabio Urbina¹, Sean Ekins¹

Affiliations

PMID: 40576990
PMCID: PMC12486300
DOI: 10.1021/acs.est.5c04054

Machine Learning and Large Language Models for Modeling Complex Toxicity Pathways and Predicting Steroidogenesis

Thomas R Lane et al. Environ Sci Technol. 2025.

. 2025 Jul 15;59(27):13844-13856.

doi: 10.1021/acs.est.5c04054. Epub 2025 Jun 27.

Authors

Thomas R Lane¹, Patricia A Vignaux¹, Joshua S Harris¹, Scott H Snyder¹, Fabio Urbina¹, Sean Ekins¹

Affiliation

¹ Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States of America.

PMID: 40576990
PMCID: PMC12486300
DOI: 10.1021/acs.est.5c04054

Abstract

High-throughput screening and computational models have been effective in predicting chemical interactions with estrogen and androgen receptors, but similar approaches for steroidogenesis remain limited. To address this gap, we developed general steroidogenesis modulation models using data from ∼1,800 chemicals screened in H295R human adrenocortical carcinoma cells. A random forest model was validated using a prospective test set of 20 compounds (14 predicted active, 6 inactive), achieving 80% accuracy with conformal prediction adjustments. In parallel, we built classification and regression models based on IC₅₀ data from ChEMBL for key steroidogenic enzymes, including CYP17A1, CYP21A2, CYP11B1, CYP11B2, 17β-HSD (1/2/3/5), 5α-reductase (1/2), and CYP19A1 (126-9,327 compounds per target). These models enable predictions of both general steroidogenesis inhibition and potential molecular targets. Additionally, we developed a transformer-based model (MolBART) to predict all end points simultaneously and validated this performance. Combined, these models may offer a rapid and scalable system for assessing chemical impacts on steroidogenesis, supporting chemical risk assessment, product stewardship, and regulatory decision-making.

Keywords: MolBART; conformal predictors; endocrine disruption; large language models; machine learning; steroidogenesis.

PubMed Disclaimer

Conflict of interest statement

Competing interests:

SE is CEO and Founder at Collaborations Pharmaceuticals, Inc. while TRL, FU, SHS are employees of this company. Other authors have no conflicts.

Figures

**Figure 1.**
A schematic representation of the steroidogenesis pathway as exemplified in H295R Human Adrenocortical Carcinoma Cells . The steroid names, structures and the enzymes known to catalyze their conversion/interconversion are annotated.

**Figure 2.**
5-fold cross-validation metrics for classification machine learning models for the modulation of steroidogenesis in the H295R model. Training dataset information (dataset size, classification distributions) is annotated (blue and grey text boxes). (A) Performance by algorithm is given numerically, with (B) example truth tables, ROC plots and probability-like scores histogram distributions examples shown for random forest and SVC models. For the histograms, red and blue bars represent the ground truth negative and positive classes, respectively. A 0.7 probability-like score is annotated on the rf model histogram highlighting the accuracy of the positive class at this threshold. Deep Learning (DL), AdaBoost decision trees (ada), Bernoulli naïve bayes (bnb), Bayesian Ridge regression (br), elastic net regression (enr), k-nearest neighbors (knn), support vector machine (svc), logistic regression (lr), xgboost (xgb) and random forest (rf).

**Figure 3.**
Distributions of 5-fold cross-validation (CV) metrics and training set size/balance for all (A) classification and (B) regression models built for the steroidogenesis targets (Table 1). All classification models have a unified threshold of 100 nM. Balance is the fraction of the positive class. Receiver operator curve “area under the curve” (AUC), Accuracy (ACC), Recall, Specificity (Spec), Precision (Prec), F-1 score (F1), Matthews Correlation Coefficient (MCC) and Cohen’s kappa coefficient (Cohen’s κ), Mean absolute error (MAE), root mean squared error (RMSE), coefficient of determination (R²).

**Figure 4.**
5-fold cross-validation metrics models for the inhibition (IC₅₀) of two steroidogenesis targets, (A,B) steroid 5-α-reductase 2 and 17β-HSD2 (C,D). Training dataset information, such as dataset size, classification distributions or activity ranges are annotated below each title. Performance by algorithm is given numerically, with example truth tables, probability-like scores histogram distributions and plotted activity (predicted vs actual activity [-logM]) examples shown for random forest models. For the histograms, red and blue bars represent the ground truth negative and positive classes, respectively. Deep Learning (DL), AdaBoost decision trees (ada), Bernoulli naïve bayes (bnb), Bayesian Ridge regression (br), elastic net regression (enr), k-nearest neighbors (knn), support vector machine (svc), logistic regression (lr), xgboost (xgb) and random forest (rf).

**Figure 5.**
Heatmap of the fold-change over the 1% DMSO controls for the compounds predicted as either active (A) or inactive (B) in our random forest steroidogenesis modulation model.

**Figure 6.**
Inhibition prediction (IC₅₀) of the modeled steroidogenesis targets for the example molecule pravastatin. The classification consensus is the based on the majority rule of 8 classification model (>4 agreement, =4 active) and the average prediction active of the regression models (−logM).

**Figure 7.**
t-SNE plot of steroidogenesis training datasets and multiple industry-relevant products. (A) Colored as either high-throughput screen (HTS) for steroidogenesis or by steroidogenesis specific targets. (B) All steroidogenesis training data are labeled as Primary with additional labeled datasets of interest to various industries as defined by the EPA CompTox dashboard. t-SNE plots have the same coordinates for all steroidogenesis training data to show dataset overlap.

See this image and copyright information in PMC

References

1. Zorn KM; Foil DH; Lane TR; Hillwalker W; Feifarek DJ; Jones F; Klaren WD; Brinkman AM; Ekins S Comparing Machine Learning Models for Aromatase (P450 19A1). Environ Sci Technol 2020, 54 (23), 15546–15555. DOI: 10.1021/acs.est.0c05771. - DOI - PMC - PubMed
1. Zorn KM; Foil DH; Lane TR; Hillwalker W; Feifarek DJ; Jones F; Klaren WD; Brinkman AM; Ekins S Comparison of Machine Learning Models for the Androgen Receptor. Environ Sci Technol 2020, 54 (21), 13690–13700. DOI: 10.1021/acs.est.0c03984. - DOI - PMC - PubMed
1. Zorn KM; Foil DH; Lane TR; Russo DP; Hillwalker W; Feifarek DJ; Jones F; Klaren WD; Brinkman AM; Ekins S Machine Learning Models for Estrogen Receptor Bioactivity and Endocrine Disruption Prediction. Environ Sci Technol 2020, 54 (19), 12202–12213. DOI: 10.1021/acs.est.0c03982. - DOI - PMC - PubMed
1. Russo DP; Zorn KM; Clark AM; Zhu H; Ekins S Comparing Multiple Machine Learning Algorithms and Metrics for Estrogen Receptor Binding Prediction. Mol Pharm 2018, 15 (10), 4361–4370. DOI: 10.1021/acs.molpharmaceut.8b00546. - DOI - PMC - PubMed
1. Shanle EK; Xu W Endocrine disrupting chemicals targeting estrogen receptor signaling: identification and mechanisms of action. Chem Res Toxicol 2011, 24 (1), 6–19. DOI: 10.1021/tx100231n. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine Learning and Large Language Models for Modeling Complex Toxicity Pathways and Predicting Steroidogenesis

Affiliation

Machine Learning and Large Language Models for Modeling Complex Toxicity Pathways and Predicting Steroidogenesis

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous