Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jul 28;54(7):2157-65.
doi: 10.1021/ci500264r. Epub 2014 Jul 17.

Are bigger data sets better for machine learning? Fusing single-point and dual-event dose response data for Mycobacterium tuberculosis

Affiliations

Are bigger data sets better for machine learning? Fusing single-point and dual-event dose response data for Mycobacterium tuberculosis

Sean Ekins et al. J Chem Inf Model. .

Abstract

Tuberculosis is a major, neglected disease for which the quest to find new treatments continues. There is an abundance of data from large phenotypic screens in the public domain against Mycobacterium tuberculosis (Mtb). Since machine learning methods can learn from past data, we were interested in addressing whether more data builds better models. We now describe using Bayesian machine learning to assess whether we can improve our models by combining the large quantities of single-point data with the much smaller (higher quality) dual-event data sets, which use both dose-response data for both whole-cell antitubercular activity and Vero cell cytotoxicity. We have evaluated 12 models ranging from different single-point, dual-event dose-response, single-point and dual-event dose-response as well as combined data sets for three distinct data sets from the same laboratory. We used a fourth data set of active and inactive compounds from the same group as well as a smaller set of 177 active compounds from GlaxoSmithKline as test sets. Our data suggest combining single-point with dual-event dose-response data does not diminish the internal or external predictive ability of the models based on the receiver operator curve (ROC) for these models (internal ROC range 0.83-0.91, external ROC range 0.62-0.83) compared to the orders of magnitude smaller dual-event models (internal ROC range 0.6-0.83 and external ROC 0.54-0.83). In conclusion, models developed with 1200-5000 compounds appear to be as predictive as those generated with 25 000-350 000 molecules. Our results have implications for justifying further high-throughput screening versus focused testing based on model predictions.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest

SE is a consultant for Collaborative Drug Discovery, Inc.

Figures

Figure 1
Figure 1
Schema to show models built and evaluated (bold outlined = dose-response data)
Figure 2
Figure 2
PCA. A. ARRA (red) and combined dose-response and cytotoxicity and single-point inactives (black), 74% of variance explained by first 3 PCs, B. 177 GSK (red) and combined dose-response and cytotoxicity and negatives (black), 74% of variance explained by the first 3 PCs.

Similar articles

Cited by

References

    1. Anon Global tuberculosis report 2013. http://www.who.int/tb/publications/global_report/en/
    1. Zhang Y. The magic bullets and tuberculosis drug targets. Annu Rev Pharmacol Toxicol. 2005;45:529–64. - PubMed
    1. Ballell L, Field RA, Duncan K, Young RJ. New small-molecule synthetic antimycobacterials. Antimicrob Agents Chemother. 2005;49:2153–2163. - PMC - PubMed
    1. Zumla AI, Gillespie SH, Hoelscher M, Philips PP, Cole ST, Abubakar I, McHugh TD, Schito M, Maeurer M, Nunn AJ. New antituberculosis drugs, regimens, and adjunct therapies: needs, advances, and future prospects. Lancet Infect Dis. 2014;14:327–340. - PubMed
    1. Ponder EL, Freundlich JS, Sarker M, Ekins S. Computational Models for Neglected Diseases: Gaps and Opportunities. Pharm Res. 2014;31:271–7. - PubMed

Publication types

Substances

LinkOut - more resources