Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 16;15(1):25866.
doi: 10.1038/s41598-025-98654-0.

A comprehensive machine learning for high throughput Tuberculosis sequence analysis, functional annotation, and visualization

Affiliations

A comprehensive machine learning for high throughput Tuberculosis sequence analysis, functional annotation, and visualization

Md Saddam Hossain et al. Sci Rep. .

Abstract

With human guidance, computers now use machine learning (ML) in artificial intelligence (AI) to learn from data, detect trends, and make predictions. Software can adapt and improve with new information. Imaging scans leverage pattern recognition to predict outcomes, diagnose disorders, and suggest treatments. Tuberculosis (TB) remains the most common bacterial disease affecting humans. The World Health Organisation reported that in 2022, 1.3 million people died from tuberculosis, with the death rate potentially reaching 66% if proper treatment isn't provided. We trained ML-supervised algorithms like XG Boost, Logistic Regression, Random Forest Classifier, Ad- aBoost, and Support Vector Machine to help classify TB patients from large RNA-sequence count data. Such algorithms provided prediction accuracies of 0.963, 0.739, 0.773, 0.866, and 0.866 sequentially. This article highlights feature importance techniques using the ML model, XGBoost, with the highest prediction accuracy of 0.963, identifying significant genes in TB RNA sequence count data. Using key machine learning features, we here identified 20 pathways, 24 gene ontologies, 20 hub genes, and 22 drugs. Next, we applied advanced computational techniques, including pathway analysis, GO, hub-protein and protein-protein interactions (PPI), transcriptomic and miRNA interactions, and drug-protein interactions, to help analyze 100 highly expressed genes.

Keywords: Bioinformatics; DEGs; Gene ontology; Hub gene; ML; PPIs; Potential drug; TB.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests. Ethics approval and consent to participate: Not applicable. Ethical consideration: Not applicable. Consent for publication: Not applicable.

Figures

Fig. 1
Fig. 1
The suggested approach and the workflow.
Fig. 2
Fig. 2
Supervised learning model to diagnosis tuberculosis.
Fig. 3
Fig. 3
Evaluation among the ML model.
Fig. 4
Fig. 4
Supervised learning to diagnosis tuberculosis.
Fig. 5
Fig. 5
An overview of the network abundance for tuberculosis DEGs. Y axis signaling pathways and X-axis denoted as negative log10 P value. Asthma has highest negative log10 P value.
Fig. 6
Fig. 6
PPI network is made up of DEGs for tuberculosis. Differentially ex- pressed protein genes are represented by the circular nodes in the picture, and the interaction between the nodes is shown by the edges. The PPI is made up of 40 nodes and 76 edges. STRING was used to build the PPI network, and Cytoscape was used to view it.
Fig. 7
Fig. 7
Identification of hub genes within the cluster using cytohubba: application of MCC (Maximal Clique Centrality) and bottleneck algorithms and network comparison. The linkages between the top 10 hub genes from each method and additional genes (yellow) are indicated by dark green high- lights. While the (A) BottleNeck has 30 nodes and 65 edges (B) MCC network has 22 nodes and 55 edges.
Fig. 8
Fig. 8
The Network Analyst’s framework for integrated regulated collaboration among DEGs and TFs, using (a) ChEA and (b) Jasper database. (a) The network contains 47 nodes and 196 edges, where (b) has 31 and 95, nodes and edges respectively. Transcription factors are represented by square nodes, while genes that are connected to transcription factors are represented by circular nodes.
Fig. 9
Fig. 9
The interconnectedness of regulated relationships between miRNAs and DEGs. Here, the circular gene representations link to the miRNAs, which are represented by the square node. Network (a) contains 22 nodes and 29 edges while (b) has 23 nodes and 54 edges both are constructed using miRTarBase and TarBase databases respectively.
Fig. 10
Fig. 10
This figure depicted 22 potential medications for tuberculosis treatment identified through the protein-drug interaction approach. Among them, 18 drugs target the C1QB gene, while the others interact with the SPR gene. In the diagram, medications are represented by rectangular nodes, and their corresponding gene targets are depicted as spherical symbols.

References

    1. Karim, M. R. et al. Explainable ai for bioinformatics: Methods, tools and applications. Brief. Bioinform.24(5), bbad236 (2023). - PubMed
    1. Han, H. & Liu, X. The challenges of explainable AI in biomedical data science. BMC Bioinform.22(Suppl 12), 443 (2022). - PMC - PubMed
    1. Kaisar, S. & Chowdhury, A. Integrating oversampling and ensemble-based machine learning techniques for an imbalanced dataset in dyslexia screening tests. ICT Express8(4), 563–568 (2022).
    1. Sprang, M., Andrade-Navarro, M. A. & Fontaine, J.-F. Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality. BMC Bioinform.23(Suppl 6), 279 (2022). - PMC - PubMed
    1. Bagcchi, S. WHO’s global tuberculosis report 2022. The Lancet Microbe4(1), e20 (2023). - PubMed

LinkOut - more resources