An automated framework for QSAR model building

Samina Kausar^{1

2}, Andre O Falcao^{3

4}

Affiliations

¹ LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal.
² BioISI: Biosystems and Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal.
³ LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal. aofalcao@ciencias.ulisboa.pt.
⁴ BioISI: Biosystems and Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal. aofalcao@ciencias.ulisboa.pt.

PMID: 29340790
PMCID: PMC5770354
DOI: 10.1186/s13321-017-0256-5

An automated framework for QSAR model building

Samina Kausar et al. J Cheminform. 2018.

. 2018 Jan 16;10(1):1.

doi: 10.1186/s13321-017-0256-5.

Authors

Samina Kausar^{1

2}, Andre O Falcao^{3

4}

Affiliations

¹ LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal.
² BioISI: Biosystems and Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal.
³ LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal. aofalcao@ciencias.ulisboa.pt.
⁴ BioISI: Biosystems and Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal. aofalcao@ciencias.ulisboa.pt.

PMID: 29340790
PMCID: PMC5770354
DOI: 10.1186/s13321-017-0256-5

Abstract

Background: In-silico quantitative structure-activity relationship (QSAR) models based tools are widely used to screen huge databases of compounds in order to determine the biological properties of chemical molecules based on their chemical structure. With the passage of time, the exponentially growing amount of synthesized and known chemicals data demands computationally efficient automated QSAR modeling tools, available to researchers that may lack extensive knowledge of machine learning modeling. Thus, a fully automated and advanced modeling platform can be an important addition to the QSAR community.

Results: In the presented workflow the process from data preparation to model building and validation has been completely automated. The most critical modeling tasks (data curation, data set characteristics evaluation, variable selection and validation) that largely influence the performance of QSAR models were focused. It is also included the ability to quickly evaluate the feasibility of a given data set to be modeled. The developed framework is tested on data sets of thirty different problems. The best-optimized feature selection methodology in the developed workflow is able to remove 62-99% of all redundant data. On average, about 19% of the prediction error was reduced by using feature selection producing an increase of 49% in the percentage of variance explained (PVE) compared to models without feature selection. Selecting only the models with a modelability score above 0.6, average PVE scores were 0.71. A strong correlation was verified between the modelability scores and the PVE of the models produced with variable selection.

Conclusions: We developed an extendable and highly customizable fully automated QSAR modeling framework. This designed workflow does not require any advanced parameterization nor depends on users decisions or expertise in machine learning/programming. With just a given target or problem, the workflow follows an unbiased standard protocol to develop reliable QSAR models by directly accessing online manually curated databases or by using private data sets. The other distinctive features of the workflow include prior estimation of data modelability to avoid time-consuming modeling trials for non modelable data sets, an efficient variable selection procedure and the facility of output availability at each modeling task for the diverse application and reproduction of historical predictions. The results reached on a selection of thirty QSAR problems suggest that the approach is capable of building reliable models even for challenging problems.

Keywords: Data set modelability; Feature selection; KNIME; Machine learning; Quantitative structure–activity relationship (QSAR); Random forests; Support vector machines; Variable importance.

PubMed Disclaimer

Figures

**Fig. 1**
Overview of automated QSAR modeling workflow

**Fig. 2**
Automated QSAR modeling methodology

**Fig. 3**
Input data set options. Overview of possible ways to submit input data to the automated QSAR modeling workflow

**Fig. 4**
Input parameters. Input configurations required before to run the workflow

**Fig. 5**
Comparison of models with and without feature selection. Pink color represents the full-model without feature selection [with all variables (F)], green color is for SF-model ((VI)1) contains predefined set of features (SF) identified by scaled permutation importance, and blue color represents SF-model ((VI)2) having selected features (SF) by unscaled variable importance measure

**Fig. 6**
Size of the problems and predictive power of fitted models. Blue dots represent externally validated models with feature selection by scaled importance, and golden yellow color denotes externally validated models with feature selection by unscaled importance measure

**Fig. 7**
Models over-fitting analysis. Models with a predefined set of features identified by scaled variable importance (a) and unscaled variable importance (b)

**Fig. 8**
$M O D I_s s R^{2}$ versus QSAR_PVE for 30 datasets. K is the number of nearest neighbors. a K = 3 and b K = 5. QSAR_PVE(IVS) is PVE score of externally validated models without feature selection (Full-model) and with selected features (SF-model). High correlation with SF-models QSAR_PVE suggests $M O D I_s s R^{2}$ is good modelability criteria. Weaker correlation between Full-model QSAR_PVE and $M O D I_s s R^{2}$ emphasize the importance of feature selection to obtain actual and reliable predictive performance of QSAR model

See this image and copyright information in PMC

References

1. Agarwal S, Dugar D, Sengupta S. Ranking chemical structures for drug discovery: a new machine learning approach. J Chem Inf Model. 2010;50:716–731. doi: 10.1021/ci9003865. - DOI - PubMed
1. Hsin KY, Ghosh S, Kitano H. Combining machine learning systems and multiple docking simulation packages to improve docking prediction reliability for network pharmacology. PLoS ONE. 2013 - PMC - PubMed
1. Matsumoto A, Aoki S, Ohwada H. Comparison of random forest and SVM for raw data in drug discovery: prediction of radiation protection and toxicity case study. Int J Mach Learn Comput. 2016;6(2):145–148. doi: 10.18178/ijmlc.2016.6.2.589. - DOI
1. Lima AN, Philot EA, Goulart Trossini GH, Barbour Scott LP, Maltarollo VG, Honorio KM. Use of machine learning approaches for novel drug discovery. Expert Opin Drug Discov. 2016;11(3):225–239. doi: 10.1517/17460441.2016.1146250. - DOI - PubMed
1. Mantus E. Toxicity testing in the 21st century. Alttox Org. 2007

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An automated framework for QSAR model building

Affiliations

An automated framework for QSAR model building

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources