Influence relevance voting: an accurate and interpretable virtual high throughput screening method

S Joshua Swamidass¹, Chloé-Agathe Azencott, Ting-Wan Lin, Hugo Gramajo, Shiou-Chuan Tsai, Pierre Baldi

Affiliations

Affiliation

¹ School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3435, USA. sswamida@ics.uci.edu

PMID: 19391629
PMCID: PMC2750043
DOI: 10.1021/ci8004379

Influence relevance voting: an accurate and interpretable virtual high throughput screening method

S Joshua Swamidass et al. J Chem Inf Model. 2009 Apr.

. 2009 Apr;49(4):756-66.

doi: 10.1021/ci8004379.

Authors

S Joshua Swamidass¹, Chloé-Agathe Azencott, Ting-Wan Lin, Hugo Gramajo, Shiou-Chuan Tsai, Pierre Baldi

Affiliation

¹ School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3435, USA. sswamida@ics.uci.edu

PMID: 19391629
PMCID: PMC2750043
DOI: 10.1021/ci8004379

Abstract

Given activity training data from high-throughput screening (HTS) experiments, virtual high-throughput screening (vHTS) methods aim to predict in silico the activity of untested chemicals. We present a novel method, the Influence Relevance Voter (IRV), specifically tailored for the vHTS task. The IRV is a low-parameter neural network which refines a k-nearest neighbor classifier by nonlinearly combining the influences of a chemical's neighbors in the training set. Influences are decomposed, also nonlinearly, into a relevance component and a vote component. The IRV is benchmarked using the data and rules of two large, open, competitions, and its performance compared to the performance of other participating methods, as well as of an in-house support vector machine (SVM) method. On these benchmark data sets, IRV achieves state-of-the-art results, comparable to the SVM in one case, and significantly better than the SVM in the other, retrieving three times as many actives in the top 1% of its prediction-sorted list. The IRV presents several other important advantages over SVMs and other methods: (1) the output predictions have a probabilistic semantic; (2) the underlying inferences are interpretable; (3) the training time is very short, on the order of minutes even for very large data sets; (4) the risk of overfitting is minimal, due to the small number of free parameters; and (5) additional information can easily be incorporated into the IRV architecture. Combined with its performance, these qualities make the IRV particularly well suited for vHTS.

PubMed Disclaimer

Figures

**Figure 1**
The neural network architecture of the IRV using a notation similar to the plate notation of graphical models. The structure and weights depicted inside the plate are replicated for and shared with each neighbor i. Thick lines depicts elements of the graph learned from the data. Dotted lines depict portions of the network computed during the preprocessing step, during which the test chemical(𝒳 is used to retrieve a list of neighbors, {𝒩₁,…,𝒩_k} from all the chemicals in the training data. The influences *I_i* of each neighbor are summed up to into a logistic output node to produce the final prediction, z(𝒳).

**Figure 2**
The influences on an accurately predicted hit from the HIV dataset, obtained in the cross-validation experiment, are displayed as a bar graph. The IRV used to compute these influences was trained using the 20 nearest neighbors. The experimental data both supporting and countering the prediction is readily apparent. The structures of four neighbors, two actives and two inactives, are shown. Compounds on the right are structurally similar and active. Compounds on the left are structurally similar and inactive.

**Figure 3**
The 10-fold cross-validated performances of various algorithms on the full HIV data are displayed using (1) a standard ROC curve; and (2) a pROC curve. The pROC curve plots the False Positive Rate on a logarithmic axis to emphasize the crucial initial portion of the ROC curve.

**Figure 4**
Top twelve accurately predicted hits by an IRV cross-validated on the McMaster dataset. Prediction-sorted rankings are in bold and the IRV outputs are in parenthesis. Strikingly, the top hits are ranked at the very top of the prediction-sorted list. Although they exhibit a high degree of similarity, the top hits have several different scaffolds.

**Figure 5**
The 10-fold cross-validated performances on the DHFR data. (a) ROC (b) pROC curve. The pROC curve plots the FPR on a logarithmic axis to emphasize the crucial first portion of the ROC curve.

**Figure 6**
Examples of IRV architectural extensions. The variable d denotes the docking score derived by fitting each test chemical, 𝒳, into the binding pocket of a known protein target, depicted as a cloud in the figure. The grayed out rectangle delimits a portion of the network which could be trivially replaced with an arbitrarily complex neural network to generate an IRV with increased modeling power; additional replaceable portions can be imagined. The variable *a_i* denotes the real-valued activity of each neighbor, 𝒩_i, in the HTS screen.

See this image and copyright information in PMC

References

1. Karakoc E, Cherkasov A, Sahinalp SC. Distance Based Algorithms For Small Biomolecule Classification And Structural Similarity Search. Bioinformatics. 2006;22:243. - PubMed
1. Zheng W, Tropsha A. Novel Variable Selection Quantitative Structure-Property Relationship Approach Based on the k-Nearest-Neighbor Principle. J. Chem. Inf. Comput. Sci. 2000;40:185–194. - PubMed
1. Cannon EO, Bender A, Palmer DS, Mitchell JBO. Chemoinformatics-Based Classification of Prohibited Substances Employed for Doping in Sport. J. Chem. Inf. Model. 2006;46:2369–2380. - PubMed
1. Geppert H, Horvath T, Gartner T, Wrobel S, Bajorath J. Support-Vector-Machine-Based Ranking Significantly Improves the Effectiveness of Similarity Searching Using 2D Fingerprints and Multiple Reference Compounds. J. Chem. Inf. Model. 2008;48:742–746. - PubMed
1. Plewczynski D, Spieser SAH, Koch U. Assessing Different Classification Methods for Virtual Screening. J. Chem. Inf. Model. 2006;46:1098–1106. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Influence relevance voting: an accurate and interpretable virtual high throughput screening method

Affiliation

Influence relevance voting: an accurate and interpretable virtual high throughput screening method

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources