Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 16;10(1):1.
doi: 10.1186/s13321-017-0256-5.

An automated framework for QSAR model building

Affiliations

An automated framework for QSAR model building

Samina Kausar et al. J Cheminform. .

Abstract

Background: In-silico quantitative structure-activity relationship (QSAR) models based tools are widely used to screen huge databases of compounds in order to determine the biological properties of chemical molecules based on their chemical structure. With the passage of time, the exponentially growing amount of synthesized and known chemicals data demands computationally efficient automated QSAR modeling tools, available to researchers that may lack extensive knowledge of machine learning modeling. Thus, a fully automated and advanced modeling platform can be an important addition to the QSAR community.

Results: In the presented workflow the process from data preparation to model building and validation has been completely automated. The most critical modeling tasks (data curation, data set characteristics evaluation, variable selection and validation) that largely influence the performance of QSAR models were focused. It is also included the ability to quickly evaluate the feasibility of a given data set to be modeled. The developed framework is tested on data sets of thirty different problems. The best-optimized feature selection methodology in the developed workflow is able to remove 62-99% of all redundant data. On average, about 19% of the prediction error was reduced by using feature selection producing an increase of 49% in the percentage of variance explained (PVE) compared to models without feature selection. Selecting only the models with a modelability score above 0.6, average PVE scores were 0.71. A strong correlation was verified between the modelability scores and the PVE of the models produced with variable selection.

Conclusions: We developed an extendable and highly customizable fully automated QSAR modeling framework. This designed workflow does not require any advanced parameterization nor depends on users decisions or expertise in machine learning/programming. With just a given target or problem, the workflow follows an unbiased standard protocol to develop reliable QSAR models by directly accessing online manually curated databases or by using private data sets. The other distinctive features of the workflow include prior estimation of data modelability to avoid time-consuming modeling trials for non modelable data sets, an efficient variable selection procedure and the facility of output availability at each modeling task for the diverse application and reproduction of historical predictions. The results reached on a selection of thirty QSAR problems suggest that the approach is capable of building reliable models even for challenging problems.

Keywords: Data set modelability; Feature selection; KNIME; Machine learning; Quantitative structure–activity relationship (QSAR); Random forests; Support vector machines; Variable importance.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Overview of automated QSAR modeling workflow
Fig. 2
Fig. 2
Automated QSAR modeling methodology
Fig. 3
Fig. 3
Input data set options. Overview of possible ways to submit input data to the automated QSAR modeling workflow
Fig. 4
Fig. 4
Input parameters. Input configurations required before to run the workflow
Fig. 5
Fig. 5
Comparison of models with and without feature selection. Pink color represents the full-model without feature selection [with all variables (F)], green color is for SF-model ((VI)1) contains predefined set of features (SF) identified by scaled permutation importance, and blue color represents SF-model ((VI)2) having selected features (SF) by unscaled variable importance measure
Fig. 6
Fig. 6
Size of the problems and predictive power of fitted models. Blue dots represent externally validated models with feature selection by scaled importance, and golden yellow color denotes externally validated models with feature selection by unscaled importance measure
Fig. 7
Fig. 7
Models over-fitting analysis. Models with a predefined set of features identified by scaled variable importance (a) and unscaled variable importance (b)
Fig. 8
Fig. 8
MODI_ssR2 versus QSAR_PVE for 30 datasets. K is the number of nearest neighbors. a K = 3 and b K = 5. QSAR_PVE(IVS) is PVE score of externally validated models without feature selection (Full-model) and with selected features (SF-model). High correlation with SF-models QSAR_PVE suggests MODI_ssR2 is good modelability criteria. Weaker correlation between Full-model QSAR_PVE and MODI_ssR2 emphasize the importance of feature selection to obtain actual and reliable predictive performance of QSAR model

References

    1. Agarwal S, Dugar D, Sengupta S. Ranking chemical structures for drug discovery: a new machine learning approach. J Chem Inf Model. 2010;50:716–731. doi: 10.1021/ci9003865. - DOI - PubMed
    1. Hsin KY, Ghosh S, Kitano H. Combining machine learning systems and multiple docking simulation packages to improve docking prediction reliability for network pharmacology. PLoS ONE. 2013 - PMC - PubMed
    1. Matsumoto A, Aoki S, Ohwada H. Comparison of random forest and SVM for raw data in drug discovery: prediction of radiation protection and toxicity case study. Int J Mach Learn Comput. 2016;6(2):145–148. doi: 10.18178/ijmlc.2016.6.2.589. - DOI
    1. Lima AN, Philot EA, Goulart Trossini GH, Barbour Scott LP, Maltarollo VG, Honorio KM. Use of machine learning approaches for novel drug discovery. Expert Opin Drug Discov. 2016;11(3):225–239. doi: 10.1517/17460441.2016.1146250. - DOI - PubMed
    1. Mantus E. Toxicity testing in the 21st century. Alttox Org. 2007

LinkOut - more resources