Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 26:16:1546770.
doi: 10.3389/fmicb.2025.1546770. eCollection 2025.

An integrative and comprehensive analysis of blood transcriptomes combined with machine learning models reveals key signatures for tuberculosis diagnosis and risk stratification

Affiliations

An integrative and comprehensive analysis of blood transcriptomes combined with machine learning models reveals key signatures for tuberculosis diagnosis and risk stratification

Maryam Omrani et al. Front Microbiol. .

Abstract

Tuberculosis (TB) remains a major global health challenge, contributing substantially to morbidity and mortality worldwide. The progression from Mycobacterium tuberculosis (Mtb) infection to active disease involves a complex interplay between host immune responses and Mtb's ability to evade them. However, current diagnostic tools, such as interferon-gamma release assays (IGRAs) and tuberculin skin tests (TSTs), have limited ability to distinguish between different stages of TB or to predict the progression from infection to active disease. In this study, we performed an integrative analysis of 324 previously acquired blood transcriptome samples from TB patients, TB contacts, and controls across diverse geographical regions. Differential gene expression analysis revealed distinct transcriptomic signatures in TB patients, highlighting dysregulated pathways related to immune responses, antimicrobial peptides, and extracellular matrix organization. Using machine learning, we identified a 99-transcript signature that accurately distinguished TB patients from controls, demonstrated strong predictive performance across different cohorts, and identified potential progressors or subclinical cases. Validation in an independent dataset comprising 90 TB patients and 20 healthy controls confirmed the robustness of the 10-gene signature (BATF2, FAM20A, FBLN2, AK5, VAMP5, MMP8, KLHDC8B, LINC00402, DEFA3, and GBP6), achieving high area under the curve (AUC) values in both receiver operating characteristic (ROC) and precision-recall analyses. This 10-gene signature offers promising candidates for further validation and the development of diagnostic and prognostic tools, supporting global efforts to improve TB detection and risk stratification.

Keywords: Mycobacterium tuberculosis; RNA-seq; biomarkers; blood; machine-learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Machine learning workflow for TB classification. After standardizing the dataset and removing non-informative transcripts, the data were split into sub-train and validation sets. Decision tree-based algorithms, including random forests (RF), adaptive boosting (ADAboost), and XGBoost (XGB), were employed for binary classification to distinguish active TB and control cases based on their differentiating DEGs. The best algorithm and hyperparameters were found using a grid search with 5-fold cross-validation to maximize F-measure. The optimal model was retrained on the entire dataset and identified informative features for classification. Subsequently, the trained model was applied to IGRA/TST + contacts labeled as active TB to access the labels and identify IGRA/TST + contacts such as active TB cases.
Figure 2
Figure 2
Summary of differentially expressed genes (DEGs) in tuberculosis stages. (A) The number of DEGs, both up- and downregulated, related to four pairwise comparisons. Volcano plots highlight the most significant genes dysregulated between (B) active TB and control group, (C) active TB and IGRA/TST + contacts, and (D) active TB and contacts. (E) A Venn diagram illustrates the overlapping TB signatures among the different comparisons.
Figure 3
Figure 3
Pathway analysis. (A) Pathway analysis comparing active TB and control groups. (B) Pathway analysis comparing active TB and IGRA/TST + contacts. (C) Pathway analysis comparing active TB and contacts.
Figure 4
Figure 4
Enrichment score barplot for TB-like group obtained by gene set enrichment analysis (GSEA) using (A) 16 Zak et al. and (B) 22 out of 27 Kaforou et al. TB gene signatures. The genes are ranked from top to bottom based on their fold changes, reflecting the degree of differential expression associated with the phenotype.
Figure 5
Figure 5
Validation of the model and top 10 features with an independent dataset. (A) ROC curve with confidence interval and individual iterations: The receiver operating characteristic (ROC) curves illustrate the performance of the classification model across five iterations of TB subsampling. Each curve corresponds to a subsampling, displaying the true-positive rate (TPR) against the false-positive rate (FPR). The dashed black diagonal line represents the random chance baseline. The blue curve represents the mean ROC across iterations, with the shaded blue region denoting the 95% confidence interval (CI) for the mean ROC. The area under the curve (AUC) for each iteration and the mean AUC value with 95% CI are annotated in the legend. (B) Precision–recall curve with confidence interval and individual iterations: The precision–recall (PR) curves evaluate the model’s precision (positive predictive value) against recall (sensitivity) for five TB subsampling iterations. Each curve represents an individual subsampling, with the green curve depicting the mean PR curve. The shaded green region highlights the 95% confidence interval (CI) for the mean PR curve. The area under the curve (AUC) for precision–recall for each iteration and the mean PR AUC with 95% CI are annotated in the legend.

Similar articles

References

    1. Abascal E., Pérez-Lago L., Martínez-Lirola M., Chiner-Oms Á., Herranz M., Chaoui I., et al. . (2019). Whole genome sequencing-based analysis of tuberculosis (TB) in migrants: rapid tools for cross-border surveillance and to distinguish between recent transmission in the host country and new importations. Euro Surveill. 24:1800005. doi: 10.2807/1560-7917.ES.2019.24.4.1800005 - DOI - PMC - PubMed
    1. Agbota G., Bonnet M., Lienhardt C. (2023). Management of Tuberculosis Infection: current situation, recent developments and operational challenges. Pathogens 12:362. doi: 10.3390/pathogens12030362, PMID: - DOI - PMC - PubMed
    1. Almatroudi A. (2022). Non-coding RNAs in tuberculosis epidemiology: platforms and approaches for investigating the genome’s dark matter. Int. J. Mol. Sci. 23:4430. doi: 10.3390/ijms23084430, PMID: - DOI - PMC - PubMed
    1. Behar S. M., Martin C. J., Booty M. G., Nishimura T., Zhao X., Gan H. X., et al. . (2011). Apoptosis is an innate defense function of macrophages against Mycobacterium tuberculosis. Mucosal Immunol. 4, 279–287. doi: 10.1038/mi.2011.3, PMID: - DOI - PMC - PubMed
    1. Behr M. A., Edelstein P. H., Ramakrishnan L. (2024). Rethinking the burden of latent tuberculosis to reprioritize research. Nat. Microbiol. 9, 1157–1158. doi: 10.1038/s41564-024-01683-0, PMID: - DOI - PubMed

LinkOut - more resources