Efficient iterative virtual screening with Apache Spark and conformal prediction

Laeeq Ahmed¹, Valentin Georgiev², Marco Capuccini^{2

3}, Salman Toor³, Wesley Schaal², Erwin Laure⁴, Ola Spjuth²

Affiliations

¹ Department of Computational Science and Technology, Royal Institute of Technology (KTH), Lindstedtsvägen 5, 10044, Stockholm, Sweden. laeeq@kth.se.
² Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 75124, Uppsala, Sweden.
³ Department of Information Technology, Uppsala University, Box 337, 75105, Uppsala, Sweden.
⁴ Department of Computational Science and Technology, Royal Institute of Technology (KTH), Lindstedtsvägen 5, 10044, Stockholm, Sweden.

PMID: 29492726
PMCID: PMC5833896
DOI: 10.1186/s13321-018-0265-z

Efficient iterative virtual screening with Apache Spark and conformal prediction

Laeeq Ahmed et al. J Cheminform. 2018.

. 2018 Mar 1;10(1):8.

doi: 10.1186/s13321-018-0265-z.

Authors

Laeeq Ahmed¹, Valentin Georgiev², Marco Capuccini^{2

3}, Salman Toor³, Wesley Schaal², Erwin Laure⁴, Ola Spjuth²

Affiliations

¹ Department of Computational Science and Technology, Royal Institute of Technology (KTH), Lindstedtsvägen 5, 10044, Stockholm, Sweden. laeeq@kth.se.
² Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 75124, Uppsala, Sweden.
³ Department of Information Technology, Uppsala University, Box 337, 75105, Uppsala, Sweden.
⁴ Department of Computational Science and Technology, Royal Institute of Technology (KTH), Lindstedtsvägen 5, 10044, Stockholm, Sweden.

PMID: 29492726
PMCID: PMC5833896
DOI: 10.1186/s13321-018-0265-z

Abstract

Background: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands.

Contribution: In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as 'low-scoring' ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling.

Results: We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub ( https://github.com/laeeq80/spark-cpvs ) and can be run on high-performance computers as well as on cloud resources.

Keywords: Apache Spark; Cloud computing; Conformal prediction; Docking; Virtual screening.

PubMed Disclaimer

Figures

**Fig. 1**
Workflow of CPVS. Signatures were generated for the whole dataset with two copies named Ds and *DsComplete*. An initial sample of *DsInit* number of molecules was randomly taken from Ds and docked against a chosen receptor and scores were calculated. To form a training set, docking scores were converted to class labels {0} and {1} representing ‘low-scoring’ and ‘high-scoring’ ligands, respectively. This was done using a 10-bin histogram of the docking scores where labels were assigned to ligands in different bins. An SVM-based conformal predictor model was trained on the training set and predictions were made on the whole Dataset *DsComplete*. The molecules were classified as ‘low-scoring’ ligands {0}, ‘high-scoring’ ligands {1} and 'unknown'. The predicted ‘low-scoring’ ligands were removed from Ds in each iteration and were hence never docked. Model efficiency was computed by finding the ratio of single label predictions [30], i.e., {0} and {1} against all predictions. The process was then repeated iteratively with a smaller data sample *DsIncr* from Ds which was docked and labeled, and the model was re-trained until it reached an acceptable efficiency. Thereafter all remaining ‘high-scoring’ ligands were docked. The scores of all docked molecules were sorted and accuracy for top 30 molecules was computed against the results from an experiment where all molecules were docked [9]

**Fig. 2**
Docking score histogram for 200 K ligands shows an example docking score histogram for a sample of 200 K ligands in log scale. The data distribution is skewed right because we have fewer molecules with high scores, which is normal for these types of datasets as only a few ligands have a good fit with the target protein and the majority will not bind with high affinity

**Fig. 3**
Benchmarking CPVS against parallel VS. On average, only 37.39% of the ligands were docked to reach an accuracy level of $\sim$ 94%. By decreasing the number of docked molecules, CPVS saves more than two-thirds of the time and got an average speedup of 3.7 in comparison to Parallel VS [9]

See this image and copyright information in PMC

References

1. Mayr LM, Bojanic D. Novel trends in high-throughput screening. Curr Opin Pharmacol. 2009;2:580–588. doi: 10.1016/j.coph.2009.08.004. - DOI - PubMed
1. Shoichet BK. Virtual screening of chemical libraries. Nature. 2004;432(7019):862. doi: 10.1038/nature03197. - DOI - PMC - PubMed
1. Subramaniam S, Mehrotra M, Gupta D. Virtual high throughput screening (vHTS)-a perspective. Bioinformation. 2008;3(1):14–17. doi: 10.6026/97320630003014. - DOI - PMC - PubMed
1. Shen M, Tian S, Pan P, Sun H, Li D, Li Y, Zhou H, Li C, Lee SMY, Hou T. Discovery of novel rock1 inhibitors via integrated virtual screening strategy and bioassays. Sci Rep. 2015;5:16749. doi: 10.1038/srep16749. - DOI - PMC - PubMed
1. Kitchen DB, Decornez H, Furr JR, Bajorath J. Docking and scoring in virtual screening for drug discovery: methods and applications. Nat Rev Drug Discov. 2004;3(11):935–949. doi: 10.1038/nrd1549. - DOI - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient iterative virtual screening with Apache Spark and conformal prediction

Affiliations

Efficient iterative virtual screening with Apache Spark and conformal prediction

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources