Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Mar 6:9:15.
doi: 10.1186/s13321-017-0204-4. eCollection 2017.

Large-scale virtual screening on public cloud resources with Apache Spark

Affiliations

Large-scale virtual screening on public cloud resources with Apache Spark

Marco Capuccini et al. J Cheminform. .

Abstract

Background: Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark.

Results: We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against [Formula: see text]2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment.

Conclusion: Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries. Our implementation is named Spark-VS and it is freely available as open source from GitHub (https://github.com/mcapuccini/spark-vs).Graphical abstract.

Keywords: Apache Spark; Cloud computing; Docking; Virtual screening.

PubMed Disclaimer

Figures

Graphical abstract
Graphical abstract
.
Fig. 1
Fig. 1
SBVS pipeline in Spark-VS. This example pipeline reads a molecular library in SDF format, docks it against a target receptor and returns the 10 top-scoring molecules. The dock primitive takes as parameters a receptor structure in the OEDocking TK binary format, a scoring method and a search resolution for the underlying docking software. In addition, the saveAsTextFile primitive is used to checkpoint all of the poses after the docking phase. This is a best practice as docking is time consuming
Fig. 2
Fig. 2
Weak Scaling Efficiency plot. Each bar represents a different run and it shows how efficiently the respective vCPUs were used. The input size was increased by a work unit, along with the number of vCPUs, in each consecutive run. The trend curve was computed by 2nd degree polynomial interpolation
Fig. 3
Fig. 3
Docking time per molecule. The histogram shows the serial docking time for each molecule in the benchmark dataset (2.2M) divided into equally spaced bins. Note that in this plot the number of molecules is on logarithmic scale

References

    1. Fox S, Farr-Jones S, Sopchak L, Boggs A, Nicely HW, Khoury R, Biros M. High-throughput screening: update on practices and success. J Biomol Screen. 2006;11(7):864–869. doi: 10.1177/1087057106292473. - DOI - PubMed
    1. Hughes JP, Rees S, Kalindjian SB, Philpott KL. Principles of early drug discovery. Br J Pharmacol. 2011;162(6):1239–1249. doi: 10.1111/j.1476-5381.2010.01127.x. - DOI - PMC - PubMed
    1. Cheng T, Li Q, Zhou Z, Wang Y, Bryant SH. Structure-based virtual screening for drug discovery: a problem-centric review. AAPS J. 2012;14(1):133–141. doi: 10.1208/s12248-012-9322-0. - DOI - PMC - PubMed
    1. Seifert MH, Lang M. Essential factors for successful virtual screening. Mini Rev Med Chem. 2008;8(1):63–72. doi: 10.2174/138955708783331540. - DOI - PubMed
    1. Villoutreix BO, Eudes R, Miteva MA. Structure-based virtual ligand screening: recent success stories. Comb Chem High Throughput Screen. 2009;12(10):1000–1016. doi: 10.2174/138620709789824682. - DOI - PubMed

LinkOut - more resources