Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 28;10(1):173.
doi: 10.1038/s41597-023-01984-9.

SARS-CoV2 billion-compound docking

Affiliations

SARS-CoV2 billion-compound docking

David M Rogers et al. Sci Data. .

Abstract

This dataset contains ligand conformations and docking scores for 1.4 billion molecules docked against 6 structural targets from SARS-CoV2, representing 5 unique proteins: MPro, NSP15, PLPro, RDRP, and the Spike protein. Docking was carried out using the AutoDock-GPU platform on the Summit supercomputer and Google Cloud. The docking procedure employed the Solis Wets search method to generate 20 independent ligand binding poses per compound. Each compound geometry was scored using the AutoDock free energy estimate, and rescored using RFScore v3 and DUD-E machine-learned rescoring models. Input protein structures are included, suitable for use by AutoDock-GPU and other docking programs. As the result of an exceptionally large docking campaign, this dataset represents a valuable resource for discovering trends across small molecule and protein binding sites, training AI models, and comparing to inhibitor compounds targeting SARS-CoV-2. The work also gives an example of how to organize and process data from ultra-large docking screens.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Fig. 1
Fig. 1
Workflow for creation of docked, scored molecule dataset. Illustrations show: (top left) examples of the initial molecule structures and information available from Enamine, (center right) alternative generated molecular structures from Virtual Flow (VF), and (lower left) docked geometries. Virtual Flow’s generated geometries are labeled by two additional numbers. The first number enumerates stereocenters and ring puckering variations. The second tries to enumerate tautomerization states. In the example here, T1 and T2 differ by re-interpreting a ketone oxygen as an alcohol–adding double-bonds and removing hydrogens from carbons in the central ring. The inputs are then passed to ligand preparation routines for AutoDock4 (AD4), which shares input requirements with AutoDock-GPU (AD-GPU). The docking results can then be rescored with machine learning (ML) techniques.
Fig. 2
Fig. 2
Molecule atom and torsion count 2D histograms, plotted on a natural logarithmic scale to emphasize the tails of the distribution. Columns contain, from left to right, (atoms,tors), (atoms,score) and (tors,score). Score refers to AutoDock-GPU’s score. Atoms refers to the count of heavy atoms plus polar hydrogens present in the pdbqt file used for docking. Tors refers to the count of torsion degrees of freedom marked in that pdbqt. Marked points show our data values for all compounds listed in ref. (pooling all isomers and geometries).
Fig. 3
Fig. 3
Docking score 2D histograms, plotted on a logarithmic scale to emphasize the tails of the distribution. Columns contain, from left to right, (score,r3), (v2,r3) and (v2,score). Score refers to AutoDock-GPU’s score, r3 and v2 refer to re-scoring of each docked pose based on random-forest functions parameterized from RF3 and Virtual-Score DUD-E, respectively. Lines and boxes mark the score cutoffs used for creating top-10k lists based on single and joint-score criteria, respectively. Marked points show our data values for all compounds listed in ref. (pooling all isomers and geometries).
Fig. 4
Fig. 4
Redocking RMSD distribution against the Mpro Fragalysis dataset (left) and the AutoDock-GPU set of 42 (right), measuring the deviation between a predicted docking pose and an actual crystal structure. The fragalysis dataset (left) docks 426 ligands where bound crystal structures are available from the Diamond dataset, and compared the displacement for only the top-scoring pose from AutoDock Vina (blue), the current version of AutoDock-GPU (red, 1.5.3) and the version of AutoDock-GPU used to generate the dataset (black, June 2020). Beyond the top poses, the distribution for all 20 predicted poses for both versions of AutoDock-GPU are given as a dashed line. The set of 42 (right) similarly compares the cumulative RMSD distribution generated after redocking.
Fig. 5
Fig. 5
Static image of molecular viewer. Each point on the scatter-plot of AD-GPU, RF3, and VS-DUDE-v2 scores represents one molecule. Selecting the scored point shows its three docked poses in the 3D structure at left, along with information boxes showing the molecule name and numerical data below.
Fig. 6
Fig. 6
Use of pandas and pybel to select hit molecules containing dicarboximide functional groups. After loading a top-N dataset, molecules are parsed by openbabel and then a SMARTS pattern search is applied to each row. The last line of the program shows an example of changing the index of the dataset to separate molecules and enumerated rotamers.

References

    1. Singh, S., Bani Baker, Q. & Singh, D. B. Molecular docking and molecular dynamics simulation. In Singh, D. B. & Pathak, R. K. (eds.) Bioinformatics, chap. 18, 291–304, 10.1016/B978-0-323-89775-4.00014-6 (Academic Press, 2022).
    1. Vermaas JV, et al. Supercomputing pipelines search for therapeutics against COVID-19. Computing in Science Engineering. 2021;23:7–16. doi: 10.1109/MCSE.2020.3036540. - DOI - PMC - PubMed
    1. Ton A-T, Gentile F, Hsing M, Ban F, Cherkasov A. Rapid identification of potential inhibitors of SARS-CoV-2 main protease by deep docking of 1.3 billion compounds. Molecular informatics. 2020;39:2000028. doi: 10.1002/minf.202000028. - DOI - PMC - PubMed
    1. Gorgulla C, et al. A multi-pronged approach targeting SARS-CoV-2 proteins using ultra-large virtual screening. iScience. 2021;24:102021. doi: 10.1016/j.isci.2020.102021. - DOI - PMC - PubMed
    1. Acharya A, et al. Supercomputer-based ensemble docking drug discovery pipeline with application to Covid-19. Journal of Chemical Information and Modeling. 2020;60:5832–5852. doi: 10.1021/acs.jcim.0c01010. - DOI - PMC - PubMed