Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Dec 26;16(1):3823.
doi: 10.1038/s41598-025-33927-2.

Early prediction of alopecia areata using machine learning modeling of neuro stress immune signatures from multi datasets

Affiliations

Early prediction of alopecia areata using machine learning modeling of neuro stress immune signatures from multi datasets

Anxin Chen et al. Sci Rep. .

Abstract

Alopecia areata (AA) is an easy-recurring disease that presents huge challenges globally. An efficient clinical tool to predict AA onset would be valuable for timely intervention. We extracted six AA-related datasets from Gene Expression Omnibus (GEO). GO, KEGG, GSEA, GSVA and CIBERSORT algorithm were performed to elucidate the characteristics of AA. Feature genes were identified using LASSO regression and Random Forest algorithms. Five machine learning algorithms (Logistic Regression, K-nearest neighbors, Elastic Net, XGBoost and LightGBM) were employed to construct predictive models, with internal and external validation conducted to determine the optimal model. Additionally, SHapley Additive exPlanations (SHAP) analysis was applied to interpret the best-performing model and shiny framework was applied to establish an online predictive website. Five datasets (GSE45512, GSE68801, GSE80342, GSE58573, GSE74761) were integrated as train set and GSE148346 was defined as test set. Tissue regeneration and immune dysregulation were the key factors in AA pathogenesis. Three feature genes (KRT83, PPP1R1C, PIRT) were selected for model construction, with innate immune response, neural inflammatory and stress being a potential regulator for AA. The XGBoost model outperformed other algorithms, SHAP provided explanations for predictions and an online predictive website was established. Our study provides a potential "neuro-stress-immune" interplay insight into the pathogenesis of AA and establishes a clinically applicable predictive model for AA onset.

Keywords: Alopecia areata; Autoimmune diseases; Diagnosis; Machine learning; Predictive learning models.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
DEGs screening. (a) Volcano plot of DEGs in train set. Some of overlap labels of DEGs were not displayed. Feature genes were marked in red. (b) Heatmap of DEGs in train set (sorted by logFC and took the top 50 and bottom 50 DEGs. There were fewer than 50 up-regulated DEGs thus all up-regulated DEGs were displayed). (c) Volcano plot of DEGs in test set. (d)Heatmap of DEGs in test set. (e) Down-regulated DEGs shared by both train set and test set. (f) Up-regulated DEGs shared by both train set and test set.
Fig. 2
Fig. 2
Enrichment analysis for DEGs. (a) Bar plot of KEGG analysis of DEGs in train set. (b) Bub plot of KEGG analysis of DEGs in train set. (c) Bar plot of GO analysis of DEGs in train set. BP: biological processes, CC: cellular components, MF: molecular functions. (d) Bub plot of GO analysis of DEGs in train set. (e) Bar plot of GO analysis of DEGs in test set. (f) Bub plot of GO analysis of DEGs in test set.
Fig. 3
Fig. 3
Enrichment and immune filtrated analysis. (a) GSEA analysis for gene sets enriched in AA samples in train set. (b) GSEA analysis for gene sets enriched in Con samples in train set. (c) GSEA analysis for gene sets enriched in AA samples in test set. (d) GSEA analysis for gene sets enriched in Con samples in test set. (e) Bar plot showed distribution of immune cells in train set. (f) Correlation between immune cells in AA samples. (g) Correlation between immune cells in Con samples. (h) Violin plot presented differential fraction of immune cells between AA and Con samples.
Fig. 4
Fig. 4
Feature genes identification and immune correlation. (a) Trajectory plot of LASSO regression coefficients as a function of Log λ. (b) Cross-validation curve of the change in the cross-validation deviance in LASSO regression with respect to Log λ. (c) Error number of trees curve in Random Forest showed change of the error of the RF model with the number of decision trees. (d) Feature importance analysis plot for Random Forest measured the importance of each gene feature through Mean Decrease Gini. (e) Venn diagram of the intersecting three genes of LASSO and RF algorithm. (fh) ROC curve of classification performance of the feature genes (f: PPP1R1C, g: PIRT, h: KRT83). The area under the curve (AUC) was used to quantify their classification ability. (l-n) Correlation between feature genes (l: KRT83, m: PPP1R1C, n: PIRT) and immune cells. P < 0.05 was considered statistically significant and marked in red on the graph. (i) Violin plot of differential expression of feature genes. The difference between their expression in AA and Con samples were highly significant. (o-q) Scatter plots showing correlation between expression of feature genes (o: KRT83, p: PPP1R1C, q: PIRT) and immune cells (The sequence from left to right was: resting CD4 + memory T cells and neutrophils in the first line, activated dendritic cells, M1 macrophages and γδ T cells in the second line respectively). R: Spearman’s rank correlation coefficient.
Fig. 5
Fig. 5
WGCNA for train set. (a) The dendrogram shows the clustering of samples based on their euclidean distance, where a cut-off threshold (indicated by the red line, height = 147.81) is applied to identify and remove any potential outlier samples. (b, c) The optimal soft-thresholding power was determined by assessing the scale-free topology fit index (R2) and mean connectivity. The chosen power was 7 (where R2 exceeds 0.85 and mean connectivity achieved stable). (d) A hierarchical clustering tree was constructed and co-expressed genes were grouped into different color modules. (e) Correlation between modules and AA was calculated, model MEbrown had the largest absolute value of the correlation coefficient with AA. (f) GO enrichment was conducted on the core module genes, retaining only the top 20 significant terms (ordered by GeneRatio, q < 0.05).
Fig. 6
Fig. 6
Correlation between neural stress and immune in AA. (a) GSVA results of neural stress-related gene sets in train set. The statistically significant threshold t-value was −1.99 and 2.47. (b) GSVA results of neural stress-related gene sets in test set. The statistically significant threshold t-value was −2.02 and 2.06. (c) Heatmap of correlations between feature genes and NIR genes in train set. (d) Heatmap of correlations between feature genes and NIR genes in test set. The same genes in both train set and test set were matched by blue lines. (e) Heatmap of correlations between immune cells and NIR genes in train set. (f) Heatmap of correlations between immune cells and NIR genes in test set. R: Spearman’s rank correlation coefficient. *P < 0.05, **p < 0.01, ***p < 0.005.
Fig. 7
Fig. 7
Gene-drug interaction network. (a) Network of gene-drug interaction. The drugs surrounding each feature gene were exclusively associated with that particular gene. Drug nodes on the concentric circles were arranged from inner to outer rings in descending order of p-value (innermost layer: p < 0.01; middle layer: 0.01 < p < 0.05; outermost layer: p > 0.05). Drugs positioned in the intermediate region between two feature genes exhibited associations with both genes, whereas those in the central region were connected to all three feature genes. Among drugs in the central region, hub drugs (p < 0.05) were emphasized by enlargement. P < 0.05 was considered statistically significant. (b) Venn diagram quantified drugs targeting the feature genes. (c) Molecular formulas, 2D Structure images and 3D Conformers of hub drugs.
Fig. 8
Fig. 8
Comparison between models. (a) Internal validation of classification performance metrics (accuracy, sensitivity, specificity, precision, F1 score and G-mean), with 95% CI provided in parentheses after the values. (b) External validation of classification performance metrics. (c) Internal validation of ROC curves. AUC values and their 95% CI were shown in the figure legend in charts. Dashed line: levels of random guessing. (d) External validation of ROC curves. (e) Internal validation of DCA curves. DCA curves had not been smoothed to better compare the details. Slanted curve: “Treat All” curve. the net benefit when a certain treatment or intervention is applied to all individuals. Horizontal line: “Treat None” curve the net benefit when no treatment or intervention is applied to all individuals. Dashed line: the line connecting the intersection points of Slanted curve and Horizontal line. (f) External validation of DCA curves. (g) ROC curves and scatter plot of five-fold cross-validation of XGBoost model in train set. Each graph represented a fold. (h) Learning curves of models displayed their stability of performance on test set under different sample size of train set. 95% Cl: 95% Confidence Interval. All numbers in the tables and graphs were rounded to four decimal places using the round half up method. SD: standard deviation.
Fig. 9
Fig. 9
Interpretation of XGBoost model. (a) SHAP value importance plot of the overall contribution degree of each gene to the model’s prediction result. Each dot represented a sample. The order of importance of genes was arranged from top to bottom. (b) SHAP value dependence plot of the dependence relationship between expression levels of feature genes and the model’s prediction results. (c) SHAP value interaction plot of the interaction effect with each other. (d) Waterfall plot of the prediction result of GSM1682045, total SHAP value was negative, indicating this sample was predicted as a Con sample. The average expected predicted value of the model across all samples was E[f(x)] = 0.104. The predicted value of the assigned sample was denoted by f(x). (e) Force plot also showed the SHAP value of GSM1682045 was negative, namely a Con sample. (f) Waterfall plot of the prediction result of GSM2124826, total SHAP value was positive, indicating this sample was predicted as an AA sample. (g) Force plot also showed the SHAP value of GSM2124826 was positive, namely an AA sample. (h) The immune cell proportions were incorporate into the XGBoost model (keeping the specific parameters the same as the original model) and SHAP interaction plot was utilize to visualize the interaction between immune cells and feature genes.
Fig. 10
Fig. 10
Immunohistochemistry for expression of PPP1R1C in biopsy samples. (a) Quantitative analysis denoted downregulation of PPP1R1C in the scalp tissues of AA patients (n = 5) compared with normal people (Con, n = 5), as reflected by diminished AOD. (b) Representative images of AA and Con respectively (40X amplified). *P < 0.05 was considered statistically significant.

References

    1. Zhou, C., Li, X., Wang, C. & Zhang, J. Alopecia areata: An update on etiopathogenesis, diagnosis, and management. Clin Rev Allergy Immunol61, 403–423 (2021). - DOI - PubMed
    1. Lee, H. H. et al. Epidemiology of alopecia areata, ophiasis, totalis, and universalis: A systematic review and meta-analysis. J Am Acad Dermatol82, 675–682 (2020). - DOI - PubMed
    1. Okhovat, J.-P. et al. Association between alopecia areata, anxiety, and depression: A systematic review and meta-analysis. J Am Acad Dermatol88, 1040–1050 (2023). - DOI - PubMed
    1. Caldarola, G. et al. Assessing a measure for Quality of Life in patients with severe Alopecia Areata: a multicentric Italian study. Front Public Health12, 1415334 (2024). - DOI - PMC - PubMed
    1. Simakou, T., Butcher, J. P., Reid, S. & Henriquez, F. L. Alopecia areata: A multifactorial autoimmune condition. J Autoimmun98, 74–85 (2019). - DOI - PubMed

LinkOut - more resources