A comprehensive machine learning for high throughput Tuberculosis sequence analysis, functional annotation, and visualization

Affiliations

¹ Department of Biomedical Engineering, Faculty of Engineering and Technology, Islamic University, Kushtia, 7003, Bangladesh. saddam.iu.bme@gmail.com.
² Department of Biomedical Engineering, Faculty of Engineering and Technology, Islamic University, Kushtia, 7003, Bangladesh.
³ Department of Industrial and Production Engineering, Faculty of Mechanical Engineering, Dhaka University of Engineering and Technology, Gazipur, 1707, Bangladesh.
⁴ University of Technology of Compiègne, 4297 TIMR, 60205, Compiègne Cedex, EA, France.
⁵ Department of Biology, Bahir Dar University, P.O.Box 79, Bahir Dar, Ethiopia. smartresercher@gmail.com.
⁶ Laboratory of Biotechnology and Natural Resources Valorization, Faculty of Sciences, Ibn Zohr University, 80060, Agadir, Morocco.
⁷ Ethnopharmacology and Pharmacognosy Team, Department of Biology, Moulay Ismail University of Meknes, Errachidia, Morocco.
⁸ Department of Botany and Microbiology, College of Science, King Saud University, P. O. BOX 2455, 11451, Riyadh, Saudi Arabia. kalmaary@ksu.edu.sa.

PMID: 40670587
PMCID: PMC12267406
DOI: 10.1038/s41598-025-98654-0

A comprehensive machine learning for high throughput Tuberculosis sequence analysis, functional annotation, and visualization

Md Saddam Hossain et al. Sci Rep. 2025.

. 2025 Jul 16;15(1):25866.

doi: 10.1038/s41598-025-98654-0.

Affiliations

¹ Department of Biomedical Engineering, Faculty of Engineering and Technology, Islamic University, Kushtia, 7003, Bangladesh. saddam.iu.bme@gmail.com.
² Department of Biomedical Engineering, Faculty of Engineering and Technology, Islamic University, Kushtia, 7003, Bangladesh.
³ Department of Industrial and Production Engineering, Faculty of Mechanical Engineering, Dhaka University of Engineering and Technology, Gazipur, 1707, Bangladesh.
⁴ University of Technology of Compiègne, 4297 TIMR, 60205, Compiègne Cedex, EA, France.
⁵ Department of Biology, Bahir Dar University, P.O.Box 79, Bahir Dar, Ethiopia. smartresercher@gmail.com.
⁶ Laboratory of Biotechnology and Natural Resources Valorization, Faculty of Sciences, Ibn Zohr University, 80060, Agadir, Morocco.
⁷ Ethnopharmacology and Pharmacognosy Team, Department of Biology, Moulay Ismail University of Meknes, Errachidia, Morocco.
⁸ Department of Botany and Microbiology, College of Science, King Saud University, P. O. BOX 2455, 11451, Riyadh, Saudi Arabia. kalmaary@ksu.edu.sa.

PMID: 40670587
PMCID: PMC12267406
DOI: 10.1038/s41598-025-98654-0

Abstract

With human guidance, computers now use machine learning (ML) in artificial intelligence (AI) to learn from data, detect trends, and make predictions. Software can adapt and improve with new information. Imaging scans leverage pattern recognition to predict outcomes, diagnose disorders, and suggest treatments. Tuberculosis (TB) remains the most common bacterial disease affecting humans. The World Health Organisation reported that in 2022, 1.3 million people died from tuberculosis, with the death rate potentially reaching 66% if proper treatment isn't provided. We trained ML-supervised algorithms like XG Boost, Logistic Regression, Random Forest Classifier, Ad- aBoost, and Support Vector Machine to help classify TB patients from large RNA-sequence count data. Such algorithms provided prediction accuracies of 0.963, 0.739, 0.773, 0.866, and 0.866 sequentially. This article highlights feature importance techniques using the ML model, XGBoost, with the highest prediction accuracy of 0.963, identifying significant genes in TB RNA sequence count data. Using key machine learning features, we here identified 20 pathways, 24 gene ontologies, 20 hub genes, and 22 drugs. Next, we applied advanced computational techniques, including pathway analysis, GO, hub-protein and protein-protein interactions (PPI), transcriptomic and miRNA interactions, and drug-protein interactions, to help analyze 100 highly expressed genes.

Keywords: Bioinformatics; DEGs; Gene ontology; Hub gene; ML; PPIs; Potential drug; TB.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests. Ethics approval and consent to participate: Not applicable. Ethical consideration: Not applicable. Consent for publication: Not applicable.

Figures

**Fig. 1**
The suggested approach and the workflow.

**Fig. 2**
Supervised learning model to diagnosis tuberculosis.

**Fig. 3**
Evaluation among the ML model.

**Fig. 4**
Supervised learning to diagnosis tuberculosis.

**Fig. 5**
An overview of the network abundance for tuberculosis DEGs. Y axis signaling pathways and X-axis denoted as negative log₁₀ P value. Asthma has highest negative log₁₀ P value.

**Fig. 6**
PPI network is made up of DEGs for tuberculosis. Differentially ex- pressed protein genes are represented by the circular nodes in the picture, and the interaction between the nodes is shown by the edges. The PPI is made up of 40 nodes and 76 edges. STRING was used to build the PPI network, and Cytoscape was used to view it.

**Fig. 7**
Identification of hub genes within the cluster using cytohubba: application of MCC (Maximal Clique Centrality) and bottleneck algorithms and network comparison. The linkages between the top 10 hub genes from each method and additional genes (yellow) are indicated by dark green high- lights. While the (A) BottleNeck has 30 nodes and 65 edges (B) MCC network has 22 nodes and 55 edges.

**Fig. 8**
The Network Analyst’s framework for integrated regulated collaboration among DEGs and TFs, using (a) ChEA and (b) Jasper database. (a) The network contains 47 nodes and 196 edges, where (b) has 31 and 95, nodes and edges respectively. Transcription factors are represented by square nodes, while genes that are connected to transcription factors are represented by circular nodes.

**Fig. 9**
The interconnectedness of regulated relationships between miRNAs and DEGs. Here, the circular gene representations link to the miRNAs, which are represented by the square node. Network (a) contains 22 nodes and 29 edges while (b) has 23 nodes and 54 edges both are constructed using miRTarBase and TarBase databases respectively.

**Fig. 10**
This figure depicted 22 potential medications for tuberculosis treatment identified through the protein-drug interaction approach. Among them, 18 drugs target the C1QB gene, while the others interact with the SPR gene. In the diagram, medications are represented by rectangular nodes, and their corresponding gene targets are depicted as spherical symbols.

See this image and copyright information in PMC

References

1. Karim, M. R. et al. Explainable ai for bioinformatics: Methods, tools and applications. Brief. Bioinform.24(5), bbad236 (2023). - PubMed
1. Han, H. & Liu, X. The challenges of explainable AI in biomedical data science. BMC Bioinform.22(Suppl 12), 443 (2022). - PMC - PubMed
1. Kaisar, S. & Chowdhury, A. Integrating oversampling and ensemble-based machine learning techniques for an imbalanced dataset in dyslexia screening tests. ICT Express8(4), 563–568 (2022).
1. Sprang, M., Andrade-Navarro, M. A. & Fontaine, J.-F. Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality. BMC Bioinform.23(Suppl 6), 279 (2022). - PMC - PubMed
1. Bagcchi, S. WHO’s global tuberculosis report 2022. The Lancet Microbe4(1), e20 (2023). - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A comprehensive machine learning for high throughput Tuberculosis sequence analysis, functional annotation, and visualization

Affiliations

A comprehensive machine learning for high throughput Tuberculosis sequence analysis, functional annotation, and visualization

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical