Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov 19;76(1):104028.
doi: 10.1016/j.identj.2025.104028. Online ahead of print.

Machine Learning-Based Transcriptomic Diagnosis of Periodontitis

Affiliations

Machine Learning-Based Transcriptomic Diagnosis of Periodontitis

Ya'nan Cheng et al. Int Dent J. .

Abstract

Background: Periodontitis, a prevalent chronic inflammatory disease, remains a global health challenge with conventional diagnostic methods hindered by subjectivity and low sensitivity. This study aimed to develop a machine learning (ML)-based diagnostic framework using transcriptomic data to enhance diagnostic accuracy and efficiency.

Methods: Transcriptomic datasets from 616 samples (452 periodontitis, 164 healthy controls) were retrieved from the Gene Expression Omnibus (GEO). Differentially expressed genes (DEGs) were identified, and functional enrichment, weighted gene co-expression network analysis (WGCNA), and immune infiltration profiling were performed. Key biomarkers were refined using Boruta and Least Absolute Shrinkage and Selection Operator (LASSO) algorithms. Independent six ML models were constructed and validated. A nomogram for risk prediction, transcription factor networks, and drug-target interactions were analysed.

Results: Five diagnostic biomarkers (CSF2RB, COL15A1, MME, NEFM, CYP24A1) were identified, with robust performance across datasets. The Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) achieved perfect classification in training and high accuracy in external validation. Immune infiltration analysis revealed significant correlations between biomarkers and immune cell populations (eg, dendritic cells, T cells). Transcription factor networks highlighted NFYA and SP1 as central regulators. Drug prediction identified re-purposable candidates with validated molecular docking affinity.

Conclusion: This study establishes a ML-driven diagnostic framework for periodontitis, integrating transcriptomic, immune, and regulatory network insights. These gene biomarkers may provide novel insight into periodontitis pathogenesis, while our diagnostic models show potential for clinical utility in personalised diagnosis, targeted intervention, and therapeutic development.

Plain language summary: Periodontitis is a common, serious condition often diagnosed too late using traditional methods that can be subjective. To improve detection, we developed an machine learning (ML) tool that analyses genetic activity in gum tissue. Using data from 616 patient samples, we identified five key genes (CSF2RB, COL15A1, MME, NEFM, CYP24A1) that act as biological 'flags' for gum disease. These genes are linked to immune responses that drive gum inflammation. Our ML models - especially two types called Random Forest and XGBoost - perfectly spotted gum disease in initial tests and remained highly accurate in new patient groups. We also created a simple scoring chart (nomogram) to predict individual risk. The genes we found interact with immune cells and vitamin D pathways, revealing new disease mechanisms. This work provides a faster, more objective way to diagnose gum disease and opens doors for personalised treatments.

Keywords: Artificial intelligence; Diagnostic models; Machine learning; Periodontitis; Precision medicine; Statistical.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest None disclosed.

Figures

Fig 1
Fig. 1
Analysis of Differentially Expressed Genes (DEGs) across datasets and their functional characterization. A-B, Volcano plots illustrating Differentially Expressed Genes (DEGs) in A, GSE16134 and B, GSE10334 datasets. Red dots represent significantly up-regulated genes (fold change > 1, p value < 0.05), while blue dots indicate significantly down-regulated genes (fold change < 1, p value < 0.05). Gray dots represent non-significant genes. C, Venn diagram demonstrating the overlap of DEGs between GSE16134 and GSE10334 datasets. Numbers in parentheses indicate the percentage of total DEGs in each dataset. D, Protein-Protein Interaction (PPI) network of overlapping DEGs constructed using STRING database. Node colour intensity reflects the degree of protein interactions, with darker red indicating higher connectivity. E: KEGG pathway enrichment analysis of DEGs. Left panel displays gene expression patterns, where red and blue bars represent up- and down-regulated genes, respectively. Right panel shows enriched KEGG pathways color-coded by functional categories, with pathway top 10 significance indicated by -log10(p-value).
Fig 2
Fig. 2
Identification of Diagnostic Genes for Periodontitis. A, Selection of soft thresholds for the construction of gene co-expression networks. B, Hierarchical clustering trees of WGCNA modules for periodontitis, illustrating the modular structure of the networks. C, Correlation analysis between module elements (MEs) and disease status, demonstrating module-trait associations. Each row corresponds to an ME, and each column represents a distinct group. D, Venn diagram depicting the intersection of disease-associated genes and DEGs. E, Boruta machine learning to select periodontitis related genes. Green demonstrated that were confirmed by Boruta method. F-G, Least Absolute Shrinkage and Selection Operator (LASSO) Regression. F, Each coloured line represents a unique gene, showcasing its coefficient profile during LASSO regularization path. The optimal lambda (λ), determined by minimizing prediction error, is visually indicated, highlighting the regularization strength that achieves the best trade-off between bias and variance. G, The LASSO model optimized via 10-fold cross-validation. H, The upset plot shows the overlap and unique features selected by six models: XGBoost, KNN, GLM, SVM, and PLS. Horizontal bars indicate the size of feature sets for each model, while vertical bars represent shared features across model combinations. Connecting dots show specific model intersections, highlighting consensus and unique features.
Fig 3
Fig. 3
Comprehensive analysis of diagnostic biomarkers and machine learning models for periodontitis. A, Expression patterns of five candidate biomarkers (CSF2RB, COL15A1, MME, NEFM, CYP24A1) across healthy and periodontitis samples in GSE16134 dataset. Box plots show median expression levels with interquartile ranges. B-D, Receiver Operating Characteristic (ROC) curves demonstrating diagnostic performance of individual biomarkers in three independent datasets: Train Dataset Diagnostic (GSE16134), GSE10334 Diagnostic, Combined Test Diagnostic (GSE23586, GSE223328, GSE223924, GSE243173, GSE273165, and GSE27993). Area Under Curve (AUC) values are indicated for each biomarker. E-G, Comparative performance of six machine learning algorithms (KNN, GLM, PLS, RF, SVM, XGBoost) across three datasets, shown through ROC curves with corresponding AUC values. H, Nomogram prediction model. I, Calibration plot of the nomogram prediction model, showing agreement between predicted and observed probabilities. The Hosmer-Lemeshow test p-value indicates model fit. J, ROC curves demonstrating diagnostic performance of nomogram prediction model in three independent datasets. K, Decision curve analysis evaluating clinical utility of biomarker combinations across different risk thresholds. Net benefit is plotted against probability threshold. L, Cost-benefit analysis of different risk stratification strategies, showing the number of high-risk individuals identified at various cost-benefit ratios.
Fig 4
Fig. 4
Correlation between gene expression and immune score, ESTIMATE score, and stromal score. A-C, Each panel shows the scatter plot of log2TPM (gene expression level) versus ESTIMATE score A, Immune score B, and Stromal score C, for the indicated genes. Each data point represents a single sample. The solid line represents the best fit line based on Spearman correlation analysis. R, Spearman correlation coefficient; p, p-value; n, number of samples. D, Proportions of immune cells in different samples. Bar plots show the average proportion of each immune cell type in Healthy (red) and periodontitis (PD) (dark red) groups. The colour bars represent different immune cell types. E, Box plots of immune cell counts. Box plots show the distribution of cell estimated proportion for each immune cell type in PD (yellow) and Healthy (green) groups. F, Correlations between immune cell types with significant differences in abundance between PD and Healthy groups. Red indicates positive correlation, blue indicates negative correlation, and an asterisk (*) indicates p < 0.05. G: Correlations between immune cell types with significant differences in abundance between PD and Healthy groups and the five diagnostic genes.
Fig 5
Fig. 5
Transcriptional regulatory network and drug prediction analysis for periodontitis diagnostic biomarkers. A, Transcriptional regulatory network of periodontitis diagnostic biomarkers. Red nodes represent periodontitis diagnostic biomarkers (CYP24A1, MME, NEFM, CSF2RB, COL15A1), while blue nodes indicate transcription factors. Edge weights represent regulatory interaction strengths. B, Drug-gene interaction network for periodontitis diagnostic biomarkers. Red nodes denote periodontitis diagnostic biomarkers, and yellow nodes represent potential therapeutic drugs predicted using the DGIdb database. Edge thickness and colour intensity correspond to interaction confidence scores, with thicker, darker red lines indicating higher prediction scores. C-D, Molecular docking analysis between periodontitis diagnostic biomarkers and predicted therapeutic compounds. The heatmap displays Vina docking scores (kcal/mol), with more negative values indicating stronger predicted binding affinities. CSF2RB showed the strongest binding affinity with Sargramostim (Vina score: −7.1), while MME exhibited the highest affinity with Candoxatril (Vina score: −7.9).

References

    1. Bartold PM. Lifestyle and periodontitis: the emergence of personalized periodontics. Periodontol 2000. 2018;78(1):7–11. - PubMed
    1. Kwon T., Lamster I.B., Levin L. Current concepts in the management of periodontitis. Int Dent J. 2021;71(6):462–476. - PMC - PubMed
    1. Usui M., Onizuka S., Sato T., Kokabu S., Ariyoshi W., Nakashima K. Mechanism of alveolar bone destruction in periodontitis - periodontal bacteria and inflammation. Jpn Dent Sci Rev. 2021;57:201–208. - PMC - PubMed
    1. Valm AM. The structure of dental plaque microbial communities in the transition from health to dental caries and periodontal disease. J Mol Biol. 2019;431(16):2957–2969. - PMC - PubMed
    1. Stavropoulos A., Bertl K., Spineli L.M., Sculean A., Cortellini P., Tonetti M. Medium- and long-term clinical benefits of periodontal regenerative/reconstructive procedures in intrabony defects: systematic review and network meta-analysis of randomized controlled clinical studies. J Clin Periodontol. 2021;48(3):410–430. - PMC - PubMed

LinkOut - more resources