Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery

Nicolas Bosc¹, Francis Atkinson², Eloy Felix², Anna Gaulton², Anne Hersey², Andrew R Leach²

Affiliations

¹ Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. nbosc@ebi.ac.uk.
² Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

PMID: 30631996
PMCID: PMC6690068
DOI: 10.1186/s13321-018-0325-4

Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery

Nicolas Bosc et al. J Cheminform. 2019.

. 2019 Jan 10;11(1):4.

doi: 10.1186/s13321-018-0325-4.

Authors

Nicolas Bosc¹, Francis Atkinson², Eloy Felix², Anna Gaulton², Anne Hersey², Andrew R Leach²

Affiliations

¹ Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. nbosc@ebi.ac.uk.
² Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

PMID: 30631996
PMCID: PMC6690068
DOI: 10.1186/s13321-018-0325-4

Abstract

Structure-activity relationship modelling is frequently used in the early stage of drug discovery to assess the activity of a compound on one or several targets, and can also be used to assess the interaction of compounds with liability targets. QSAR models have been used for these and related applications over many years, with good success. Conformal prediction is a relatively new QSAR approach that provides information on the certainty of a prediction, and so helps in decision-making. However, it is not always clear how best to make use of this additional information. In this article, we describe a case study that directly compares conformal prediction with traditional QSAR methods for large-scale predictions of target-ligand binding. The ChEMBL database was used to extract a data set comprising data from 550 human protein targets with different bioactivity profiles. For each target, a QSAR model and a conformal predictor were trained and their results compared. The models were then evaluated on new data published since the original models were built to simulate a "real world" application. The comparative study highlights the similarities between the two techniques but also some differences that it is important to bear in mind when the methods are used in practical drug discovery applications.

Keywords: ChEMBL; Cheminformatics; Classification models; Mondrian conformal prediction; QSAR.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Schema of the data collection from ChEMBL

**Fig. 2**
Percentage of the 550 selected targets by protein families. The protein family colours are the same for all the figures

**Fig. 3**
Mean CCR of the 550 QSAR models grouped by protein family

**Fig. 4**
Overall sensitivity, specificity and CCR for the 550 conformal predictors at different confidence levels. Results show the performance according to whether the ‘both’ predictions are included or excluded from the calculation

**Fig. 5**
Sensitivity (a) and specificity (b) versus the ratio of active to inactive compounds for each QSAR models. Colours represent the protein families as described in the legend of the Fig. 3

**Fig. 6**
CCR comparison between results of QSAR and MCP models at 80% (a, b), and 90% (c, d). In a, c The ‘both’ class prediction is included for model evaluation while it is left-out in (b, d). The targets are divided in four quadrans depending on whether they have good results for both MCP and QSAR (upper-right), either MCP (upper-left) or QSAR (bottom-right), or none of them (bottom-left)

**Fig. 7**
Evolution of the MCP performance depending on the confidence level for hERG

**Fig. 8**
Performance of the MCP models on the temporal validation set at different confidence levels. The results show the performance according to whether the ‘both’ predictions are included or excluded from the calculation

**Fig. 9**
Comparison of the compound assignments in the uncertain class for MCP (at 80% confidence level) with QSAR for a the inactive and b the active compounds. The pink set represents the molecules (active or inactive) that are correctly predicted by QSAR, the green set represents the uncertain predictions from MCP and the brown set is the intersection between the sets, that is to say, the molecules predicted as uncertain by MCP but correctly predicted by QSAR

See this image and copyright information in PMC

References

1. Cherkasov A, Muratov EN, Fourches D, et al. QSAR modeling: Where have you been? Where are you going to? J Med Chem. 2014;57:4977–5010. - PMC - PubMed
1. Nicola G, Liu T, Gilson MK. Public domain databases for medicinal chemistry. J Med Chem. 2012;55:6987–7002. - PMC - PubMed
1. Mendez D, Gaulton A, Bento AP, et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2018 doi: 10.1093/nar/gky1075. - DOI - PMC - PubMed
1. Verma J, Khedkar V, Coutinho E. 3D-QSAR in drug design: a review. Curr Top Med Chem. 2010;10:95–115. - PubMed
1. Quintero FA, Patel SJ, Muñoz F, Sam Mannan M. Review of existing QSAR/QSPR models developed for properties used in hazardous chemicals classification system. Ind Eng Chem Res. 2012;51:16101–16115.

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery

Affiliations

Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources