. 2011 Sep 26;51(9):2115-31.

doi: 10.1021/ci200269q. Epub 2011 Aug 29.

CSAR benchmark exercise of 2010: combined evaluation across all submitted scoring functions

Richard D Smith¹, James B Dunbar Jr, Peter Man-Un Ung, Emilio X Esposito, Chao-Yie Yang, Shaomeng Wang, Heather A Carlson

Affiliations

PMID: 21809884
PMCID: PMC3186041
DOI: 10.1021/ci200269q

Free PMC article

CSAR benchmark exercise of 2010: combined evaluation across all submitted scoring functions

Richard D Smith et al. J Chem Inf Model. 2011.

Free PMC article

. 2011 Sep 26;51(9):2115-31.

doi: 10.1021/ci200269q. Epub 2011 Aug 29.

Authors

Richard D Smith¹, James B Dunbar Jr, Peter Man-Un Ung, Emilio X Esposito, Chao-Yie Yang, Shaomeng Wang, Heather A Carlson

Affiliation

¹ Department of Medicinal Chemistry, University of Michigan, Ann Arbor, Michigan 48109-1065, United States.

PMID: 21809884
PMCID: PMC3186041
DOI: 10.1021/ci200269q

Abstract

As part of the Community Structure-Activity Resource (CSAR) center, a set of 343 high-quality, protein-ligand crystal structures were assembled with experimentally determined K(d) or K(i) information from the literature. We encouraged the community to score the crystallographic poses of the complexes by any method of their choice. The goal of the exercise was to (1) evaluate the current ability of the field to predict activity from structure and (2) investigate the properties of the complexes and methods that appear to hinder scoring. A total of 19 different methods were submitted with numerous parameter variations for a total of 64 sets of scores from 16 participating groups. Linear regression and nonparametric tests were used to correlate scores to the experimental values. Correlation to experiment for the various methods ranged R(2) = 0.58-0.12, Spearman ρ = 0.74-0.37, Kendall τ = 0.55-0.25, and median unsigned error = 1.00-1.68 pK(d) units. All types of scoring functions-force field based, knowledge based, and empirical-had examples with high and low correlation, showing no bias/advantage for any particular approach. The data across all the participants were combined to identify 63 complexes that were poorly scored across the majority of the scoring methods and 123 complexes that were scored well across the majority. The two sets were compared using a Wilcoxon rank-sum test to assess any significant difference in the distributions of >400 physicochemical properties of the ligands and the proteins. Poorly scored complexes were found to have ligands that were the same size as those in well-scored complexes, but hydrogen bonding and torsional strain were significantly different. These comparisons point to a need for CSAR to develop data sets of congeneric series with a range of hydrogen-bonding and hydrophobic characteristics and a range of rotatable bonds.

PubMed Disclaimer

Figures

**Figure 1**
Example of comparing a set of scores, pK_d (calculated), to their corresponding experimentally determined affinities. (Top) When fitting a line (black) using least-squares linear regression, the distance in the y direction between each data point and the line is its residual. (Bottom) The residuals for all the data points have a normal distribution around zero. The characteristics are well-defined, including the definition of standard deviation (σ in red, which happens to be 1.4 pK_d in this example) and the number of data points with residuals outside ± σ (15.8% in each tail). Higher correlations lead to larger R² and smaller σ; weaker correlations lead to lower R² and larger σ, but the distributions remain Gaussian in shape.

**Figure 2**
Crystal structure of FXa bound with a 5 pM ligand (PDB id 2p3t). The ligand is very exposed with few hydrogen bonds to the protein.

**Figure 3**
Least-squares linear regression of the 17 core scoring functions. Black lines are the linear regression fit. Red lines indicate +σ and −σ, the standard deviation of the residuals. Blue points are UNDER complexes which were underscored in ≥12 of the 17 functions. The red points are OVER complexes which were overscored in ≥12 of the 17 functions.

**Figure 4**
Comparison of experimental and calculated values from the nine functions which predicted absolute binding affinity, listed roughly in order of increasing Med |Err| and RMSE. Black lines represent perfect agreement. The red lines indicate +Med |Err| and −Med |Err| from the black line. The blue circles denote complexes for which ≥7 of the 9 methods have consistently underestimated the affinity by at least Med |Err|, while the red circles are those where the affinity was overestimated.

**Figure 5**
Distribution of binding affinities in the GOOD and BAD complexes (left) are compared to those of the NULL case (right). The NULL case is generated by the sets of all complexes with affinities ≤50 nM (high), 50 nM–50 μM (middle), and ≥50 μM (low). This midrange of affinities is highlighted with a wide, gray bar on both figures.

**Figure 6**
Distribution of amino acids in the binding sites of the GOOD and BAD complexes meeting the ≥12 of 17 definition (left) are compared to those of the NULL case (right). The graph in the lower left provides the distribution of all amino acids in the full protein sequences to show that the important trends do not result from inherent differences in composition of the proteins (the same is true of the NULLs, data not shown). Metals and modified residues are denoted as other, “OTH”. Averages and error bars for the amino acid content were determined by bootstrapping.

See this image and copyright information in PMC

References

1. Leach A. R.; Shoichet B. K.; Peishoff C. E. Prediction of protein-ligand interactions. Docking and scoring: successes and gaps. J. Med. Chem. 2006, 49, 5851–5855. - PubMed
1. Warren G. L.; Andrews C. W.; Capelli A.-M.; Clarke B.; LaLonde J.; Lambert M. H.; Lindvall M.; Nevins N.; Semus S. F.; Senger S.; Tedesco G.; Wall I. D.; Woolven J. M.; Peishoff C. E.; Head M. S. A critical assessment of docking programs and scoring functions. J. Med. Chem. 2006, 49, 5912–5931. - PubMed
1. Dunbar J. B. Jr.; Smith R. D.; Yang C. Y.; Ung P. M.; Lexa K. W.; Khazanov N. A.; Stuckey J. A.; Wang S.; Carlson H. A. CSAR Benchmark Exercise of 2010: Selection of the protein-ligand complexes. J. Chem. Inf. Model. 2011, 10.1021/ci200082t. - PMC - PubMed
1. Wang R.; Lu Y.; Wang S. Comparative evaluation of 11 scoring functions for molecular docking. J. Med. Chem. 2003, 46, 2287–2303. - PubMed
1. Muchmore S. W.; Debe D. A.; Metz J. T.; Brown S. P.; Martin Y. C.; Hajduk P. J. Application of belief theory to similarity data fusion for use in analog searching and lead hopping. J. Chem. Inf. Model. 2008, 48, 941–948. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

U01 GM086873/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CSAR benchmark exercise of 2010: combined evaluation across all submitted scoring functions

Affiliation

CSAR benchmark exercise of 2010: combined evaluation across all submitted scoring functions

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials