Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar 1;10(1):8.
doi: 10.1186/s13321-018-0265-z.

Efficient iterative virtual screening with Apache Spark and conformal prediction

Affiliations

Efficient iterative virtual screening with Apache Spark and conformal prediction

Laeeq Ahmed et al. J Cheminform. .

Abstract

Background: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands.

Contribution: In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as 'low-scoring' ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling.

Results: We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub ( https://github.com/laeeq80/spark-cpvs ) and can be run on high-performance computers as well as on cloud resources.

Keywords: Apache Spark; Cloud computing; Conformal prediction; Docking; Virtual screening.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Workflow of CPVS. Signatures were generated for the whole dataset with two copies named Ds and DsComplete. An initial sample of DsInit number of molecules was randomly taken from Ds and docked against a chosen receptor and scores were calculated. To form a training set, docking scores were converted to class labels {0} and {1} representing ‘low-scoring’ and ‘high-scoring’ ligands, respectively. This was done using a 10-bin histogram of the docking scores where labels were assigned to ligands in different bins. An SVM-based conformal predictor model was trained on the training set and predictions were made on the whole Dataset DsComplete. The molecules were classified as ‘low-scoring’ ligands {0}, ‘high-scoring’ ligands {1} and 'unknown'. The predicted ‘low-scoring’ ligands were removed from Ds in each iteration and were hence never docked. Model efficiency was computed by finding the ratio of single label predictions [30], i.e., {0} and {1} against all predictions. The process was then repeated iteratively with a smaller data sample DsIncr from Ds which was docked and labeled, and the model was re-trained until it reached an acceptable efficiency. Thereafter all remaining ‘high-scoring’ ligands were docked. The scores of all docked molecules were sorted and accuracy for top 30 molecules was computed against the results from an experiment where all molecules were docked [9]
Fig. 2
Fig. 2
Docking score histogram for 200 K ligands shows an example docking score histogram for a sample of 200 K ligands in log scale. The data distribution is skewed right because we have fewer molecules with high scores, which is normal for these types of datasets as only a few ligands have a good fit with the target protein and the majority will not bind with high affinity
Fig. 3
Fig. 3
Benchmarking CPVS against parallel VS. On average, only 37.39% of the ligands were docked to reach an accuracy level of 94%. By decreasing the number of docked molecules, CPVS saves more than two-thirds of the time and got an average speedup of 3.7 in comparison to Parallel VS [9]

References

    1. Mayr LM, Bojanic D. Novel trends in high-throughput screening. Curr Opin Pharmacol. 2009;2:580–588. doi: 10.1016/j.coph.2009.08.004. - DOI - PubMed
    1. Shoichet BK. Virtual screening of chemical libraries. Nature. 2004;432(7019):862. doi: 10.1038/nature03197. - DOI - PMC - PubMed
    1. Subramaniam S, Mehrotra M, Gupta D. Virtual high throughput screening (vHTS)-a perspective. Bioinformation. 2008;3(1):14–17. doi: 10.6026/97320630003014. - DOI - PMC - PubMed
    1. Shen M, Tian S, Pan P, Sun H, Li D, Li Y, Zhou H, Li C, Lee SMY, Hou T. Discovery of novel rock1 inhibitors via integrated virtual screening strategy and bioassays. Sci Rep. 2015;5:16749. doi: 10.1038/srep16749. - DOI - PMC - PubMed
    1. Kitchen DB, Decornez H, Furr JR, Bajorath J. Docking and scoring in virtual screening for drug discovery: methods and applications. Nat Rev Drug Discov. 2004;3(11):935–949. doi: 10.1038/nrd1549. - DOI - PubMed