Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Apr;49(4):756-66.
doi: 10.1021/ci8004379.

Influence relevance voting: an accurate and interpretable virtual high throughput screening method

Affiliations

Influence relevance voting: an accurate and interpretable virtual high throughput screening method

S Joshua Swamidass et al. J Chem Inf Model. 2009 Apr.

Abstract

Given activity training data from high-throughput screening (HTS) experiments, virtual high-throughput screening (vHTS) methods aim to predict in silico the activity of untested chemicals. We present a novel method, the Influence Relevance Voter (IRV), specifically tailored for the vHTS task. The IRV is a low-parameter neural network which refines a k-nearest neighbor classifier by nonlinearly combining the influences of a chemical's neighbors in the training set. Influences are decomposed, also nonlinearly, into a relevance component and a vote component. The IRV is benchmarked using the data and rules of two large, open, competitions, and its performance compared to the performance of other participating methods, as well as of an in-house support vector machine (SVM) method. On these benchmark data sets, IRV achieves state-of-the-art results, comparable to the SVM in one case, and significantly better than the SVM in the other, retrieving three times as many actives in the top 1% of its prediction-sorted list. The IRV presents several other important advantages over SVMs and other methods: (1) the output predictions have a probabilistic semantic; (2) the underlying inferences are interpretable; (3) the training time is very short, on the order of minutes even for very large data sets; (4) the risk of overfitting is minimal, due to the small number of free parameters; and (5) additional information can easily be incorporated into the IRV architecture. Combined with its performance, these qualities make the IRV particularly well suited for vHTS.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The neural network architecture of the IRV using a notation similar to the plate notation of graphical models. The structure and weights depicted inside the plate are replicated for and shared with each neighbor i. Thick lines depicts elements of the graph learned from the data. Dotted lines depict portions of the network computed during the preprocessing step, during which the test chemical(𝒳 is used to retrieve a list of neighbors, {𝒩1,…,𝒩k} from all the chemicals in the training data. The influences Ii of each neighbor are summed up to into a logistic output node to produce the final prediction, z(𝒳).
Figure 2
Figure 2
The influences on an accurately predicted hit from the HIV dataset, obtained in the cross-validation experiment, are displayed as a bar graph. The IRV used to compute these influences was trained using the 20 nearest neighbors. The experimental data both supporting and countering the prediction is readily apparent. The structures of four neighbors, two actives and two inactives, are shown. Compounds on the right are structurally similar and active. Compounds on the left are structurally similar and inactive.
Figure 3
Figure 3
The 10-fold cross-validated performances of various algorithms on the full HIV data are displayed using (1) a standard ROC curve; and (2) a pROC curve. The pROC curve plots the False Positive Rate on a logarithmic axis to emphasize the crucial initial portion of the ROC curve.
Figure 4
Figure 4
Top twelve accurately predicted hits by an IRV cross-validated on the McMaster dataset. Prediction-sorted rankings are in bold and the IRV outputs are in parenthesis. Strikingly, the top hits are ranked at the very top of the prediction-sorted list. Although they exhibit a high degree of similarity, the top hits have several different scaffolds.
Figure 5
Figure 5
The 10-fold cross-validated performances on the DHFR data. (a) ROC (b) pROC curve. The pROC curve plots the FPR on a logarithmic axis to emphasize the crucial first portion of the ROC curve.
Figure 6
Figure 6
Examples of IRV architectural extensions. The variable d denotes the docking score derived by fitting each test chemical, 𝒳, into the binding pocket of a known protein target, depicted as a cloud in the figure. The grayed out rectangle delimits a portion of the network which could be trivially replaced with an arbitrarily complex neural network to generate an IRV with increased modeling power; additional replaceable portions can be imagined. The variable ai denotes the real-valued activity of each neighbor, 𝒩i, in the HTS screen.

References

    1. Karakoc E, Cherkasov A, Sahinalp SC. Distance Based Algorithms For Small Biomolecule Classification And Structural Similarity Search. Bioinformatics. 2006;22:243. - PubMed
    1. Zheng W, Tropsha A. Novel Variable Selection Quantitative Structure-Property Relationship Approach Based on the k-Nearest-Neighbor Principle. J. Chem. Inf. Comput. Sci. 2000;40:185–194. - PubMed
    1. Cannon EO, Bender A, Palmer DS, Mitchell JBO. Chemoinformatics-Based Classification of Prohibited Substances Employed for Doping in Sport. J. Chem. Inf. Model. 2006;46:2369–2380. - PubMed
    1. Geppert H, Horvath T, Gartner T, Wrobel S, Bajorath J. Support-Vector-Machine-Based Ranking Significantly Improves the Effectiveness of Similarity Searching Using 2D Fingerprints and Multiple Reference Compounds. J. Chem. Inf. Model. 2008;48:742–746. - PubMed
    1. Plewczynski D, Spieser SAH, Koch U. Assessing Different Classification Methods for Virtual Screening. J. Chem. Inf. Model. 2006;46:1098–1106. - PubMed

Publication types

Substances