Prediction of Tuberculosis Using an Automated Machine Learning Platform for Models Trained on Synthetic Data
- PMID: 35136677
- PMCID: PMC8794034
- DOI: 10.4103/jpi.jpi_75_21
Prediction of Tuberculosis Using an Automated Machine Learning Platform for Models Trained on Synthetic Data
Abstract
High-quality medical data is critical to the development and implementation of machine learning (ML) algorithms in healthcare; however, security, and privacy concerns continue to limit access. We sought to determine the utility of "synthetic data" in training ML algorithms for the detection of tuberculosis (TB) from inflammatory biomarker profiles. A retrospective dataset (A) comprised of 278 patients was used to generate synthetic datasets (B, C, and D) for training models prior to secondary validation on a generalization dataset. ML models trained and validated on the Dataset A (real) demonstrated an accuracy of 90%, a sensitivity of 89% (95% CI, 83-94%), and a specificity of 100% (95% CI, 81-100%). Models trained using the optimal synthetic dataset B showed an accuracy of 91%, a sensitivity of 93% (95% CI, 87-96%), and a specificity of 77% (95% CI, 50-93%). Synthetic datasets C and D displayed diminished performance measures (respective accuracies of 71% and 54%). This pilot study highlights the promise of synthetic data as an expedited means for ML algorithm development.
Keywords: Artificial intelligence; biomarkers; data accessibility; electronic medical record; privacy; simulation.
Copyright: © 2022 Journal of Pathology Informatics.
Conflict of interest statement
Dr. Rashidi is a co-inventor of MILO and owns shares in MILO-ML, LLC. Dr. Albahra is a co-inventor of the MILO software and owns shares in MILO-ML, LLC. Dr. Tran is a co-inventor of the MILO software and owns shares in MILO-ML, LLC. He is also a consultant for Roche Diagnostics and Roche Molecular Systems.
Figures





Similar articles
-
Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset.PLoS One. 2023 Mar 16;18(3):e0283094. doi: 10.1371/journal.pone.0283094. eCollection 2023. PLoS One. 2023. PMID: 36928534 Free PMC article.
-
Automated machine learning for endemic active tuberculosis prediction from multiplex serological data.Sci Rep. 2021 Sep 9;11(1):17900. doi: 10.1038/s41598-021-97453-7. Sci Rep. 2021. PMID: 34504228 Free PMC article.
-
Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data.PLoS One. 2024 Feb 5;19(2):e0297271. doi: 10.1371/journal.pone.0297271. eCollection 2024. PLoS One. 2024. PMID: 38315667 Free PMC article.
-
Privacy preserving Generative Adversarial Networks to model Electronic Health Records.Neural Netw. 2022 Sep;153:339-348. doi: 10.1016/j.neunet.2022.06.022. Epub 2022 Jun 25. Neural Netw. 2022. PMID: 35779443
-
Synthetic data as an enabler for machine learning applications in medicine.iScience. 2022 Oct 13;25(11):105331. doi: 10.1016/j.isci.2022.105331. eCollection 2022 Nov 18. iScience. 2022. PMID: 36325058 Free PMC article. Review.
Cited by
-
Diagnostic Performance of Artificial Intelligence-Based Methods for Tuberculosis Detection: Systematic Review.J Med Internet Res. 2025 Mar 7;27:e69068. doi: 10.2196/69068. J Med Internet Res. 2025. PMID: 40053773 Free PMC article.
-
Enhancing and Not Replacing Clinical Expertise: Improving Named-Entity Recognition in Colonoscopy Reports Through Mixed Real-Synthetic Training Sources.J Pers Med. 2025 Jul 30;15(8):334. doi: 10.3390/jpm15080334. J Pers Med. 2025. PMID: 40863396 Free PMC article.
References
-
- Mayer-Schonberger V., Ingelsson E. Big data and medicine: A big deal? J Intern Med. 2017;289:418–429. - PubMed