Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 20:13:10.
doi: 10.4103/jpi.jpi_75_21. eCollection 2022.

Prediction of Tuberculosis Using an Automated Machine Learning Platform for Models Trained on Synthetic Data

Affiliations

Prediction of Tuberculosis Using an Automated Machine Learning Platform for Models Trained on Synthetic Data

Hooman H Rashidi et al. J Pathol Inform. .

Abstract

High-quality medical data is critical to the development and implementation of machine learning (ML) algorithms in healthcare; however, security, and privacy concerns continue to limit access. We sought to determine the utility of "synthetic data" in training ML algorithms for the detection of tuberculosis (TB) from inflammatory biomarker profiles. A retrospective dataset (A) comprised of 278 patients was used to generate synthetic datasets (B, C, and D) for training models prior to secondary validation on a generalization dataset. ML models trained and validated on the Dataset A (real) demonstrated an accuracy of 90%, a sensitivity of 89% (95% CI, 83-94%), and a specificity of 100% (95% CI, 81-100%). Models trained using the optimal synthetic dataset B showed an accuracy of 91%, a sensitivity of 93% (95% CI, 87-96%), and a specificity of 77% (95% CI, 50-93%). Synthetic datasets C and D displayed diminished performance measures (respective accuracies of 71% and 54%). This pilot study highlights the promise of synthetic data as an expedited means for ML algorithm development.

Keywords: Artificial intelligence; biomarkers; data accessibility; electronic medical record; privacy; simulation.

PubMed Disclaimer

Conflict of interest statement

Dr. Rashidi is a co-inventor of MILO and owns shares in MILO-ML, LLC. Dr. Albahra is a co-inventor of the MILO software and owns shares in MILO-ML, LLC. Dr. Tran is a co-inventor of the MILO software and owns shares in MILO-ML, LLC. He is also a consultant for Roche Diagnostics and Roche Molecular Systems.

Figures

Fig. 1
Fig. 1
Study design.
Fig. 2
Fig. 2
Distribution of Dataset A vs. Dataset B.
Fig. 3
Fig. 3
Overview of MILO workflow.
Fig. 4
Fig. 4
QQ plot of Dataset A vs. Dataset B: The figure shows the Q-Q (quantile-quantile) plot for each attribute in the original dataset and the synthetic dataset. It shows that the distribution of each attribute is similar across the two datasets.
Fig. 5
Fig. 5
Paradigm for AI/ML development in healthcare. Synthetic data may help to improve access to clinical data if it is shown to reduce regulatory hurdles.

Similar articles

Cited by

References

    1. Mayer-Schonberger V., Ingelsson E. Big data and medicine: A big deal? J Intern Med. 2017;289:418–429. - PubMed
    1. Singh R.P., Hom G.L., Abramoff M.D., et al. Current challenges and barriers to real-world artificial intelligence adoption for the health care system, provider, and the patient. Transl Vis Sci Technol. 2020;9:45. - PMC - PubMed
    1. Rashidi H.H., Tran N.K., Betts E.V., Howell L.P., Green R. Artificial intelligence and machine learning in pathology: The present landscape of supervised methods. Acad Pathol. 2019;6 2374289519873088. - PMC - PubMed
    1. Agrawal R., Prabakaran S. Big data in digital healthcare: Lessons learnt and recommendations for general practice. Heredity (Edinb) 2020;124:525–534. - PMC - PubMed
    1. Miller D.D. The medical AI insurgency: What physicians must know about data to practice with intelligent machines. NPJ Digit Med. 2019;2:62. - PMC - PubMed