Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Mar 10:6:50.
doi: 10.1186/1471-2105-6-50.

Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling

Affiliations

Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling

Eva Freyhult et al. BMC Bioinformatics. .

Abstract

Background: Proteochemometrics is a new methodology that allows prediction of protein function directly from real interaction measurement data without the need of 3D structure information. Several reported proteochemometric models of ligand-receptor interactions have already yielded significant insights into various forms of bio-molecular interactions. The proteochemometric models are multivariate regression models that predict binding affinity for a particular combination of features of the ligand and protein. Although proteochemometric models have already offered interesting results in various studies, no detailed statistical evaluation of their average predictive power has been performed. In particular, variable subset selection performed to date has always relied on using all available examples, a situation also encountered in microarray gene expression data analysis.

Results: A methodology for an unbiased evaluation of the predictive power of proteochemometric models was implemented and results from applying it to two of the largest proteochemometric data sets yet reported are presented. A double cross-validation loop procedure is used to estimate the expected performance of a given design method. The unbiased performance estimates (P2) obtained for the data sets that we consider confirm that properly designed single proteochemometric models have useful predictive power, but that a standard design based on cross validation may yield models with quite limited performance. The results also show that different commercial software packages employed for the design of proteochemometric models may yield very different and therefore misleading performance estimates. In addition, the differences in the models obtained in the double CV loop indicate that detailed chemical interpretation of a single proteochemometric model is uncertain when data sets are small.

Conclusion: The double CV loop employed offer unbiased performance estimates about a given proteochemometric modelling procedure, making it possible to identify cases where the proteochemometric design does not result in useful predictive models. Chemical interpretations of single proteochemometric models are uncertain and should instead be based on all the models selected in the double CV loop employed here.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of double loop. An overview of the double loop CV procedure used to obtain the desired unbiased performance estimate P2. In the embedded inner CV loop, the most promising model is selected which yields the largest unbiased performance estimate, Q2. In the outer CV loop, external test examples are kept outside the inner loop and are only used to test the most promising model found in the inner loop. Note that the estimate P2 reflects the average performance of the modelling procedure employed in the inner loop and that the estimate is based on many different models designed in the inner loop.
Figure 2
Figure 2
External predictions for amine data set. External predictions for amine data set sorted according to growing values of pKi. The figure shows the experimental pKi values (dashed line) and the mean value of the predicted pKi values (solid line) with a 95 % confidence interval (dotted lines). The predictions shown in the figure are from the PLS modelling after variable selection using PLSfilter. The results indicate that the high and low pKi values are hard to predict.
Figure 3
Figure 3
External prediction for alpha data set. External predictions for alpha data set sorted according to growing values of pKi. The figure shows the experimental pKi values (dashed line) and the mean value of the predicted pKi values (solid line) with a 95 % confidence interval (dotted lines). The predictions shown in the figure are from the PLS modelling after variable selection using PLSfilter.
Figure 4
Figure 4
Comparison of software. Q2 values obtained using different software for the prediction of affinities based on PLS models, without variable selection, for the amine data set. Between one and ten latent variables were used and SIMCA (dashed line), UNSCRAMBLER (dash-dotted line), GOLPE (dotted line) and MATLAB (solid line) were used to both build the models and evaluate them by computing Q2 values. The SIMCA Q2 values are much higher than the other Q2 values.
Figure 5
Figure 5
Hit rates. A The hit rates for the receptor blocks in the amine data set. The figure shows for each transmembrane region the hit rates for the original receptor descriptors, the cross term descriptors and the absolute valued cross term descriptors involving that transmembrane region. B The hit rates for the ligand blocks in the amine data set. The figure shows for each ligand descriptor block the hit rates for the original receptor descriptors, the cross term descriptors and the absolute valued cross term descriptors involving that ligand descriptor block. C and D The corresponding hit rates for the alpha data set. The blue bars show the hit rates computed for the PLS models using PLSfilter, the cyan bars show the hit rates computed for the PLS models using corrfilter, the red bars show the hit rates computed for the RR models using PLSfilter, and the yellow bars show the hit rates computed for the RR models using corrfilter
Figure 6
Figure 6
Detailed contributions to affinity. A Contributions of TM regions in amine GPCRs to the ligand affinity according to the proteochemometrics models created using PLS in combination with the variable selection method PLSfilter. The contributions are shown for all the 21 receptors, for each receptor 23 bars corresponding to the 23 ligands are shown (in alphabetical order i.e., Amperozide, Clozapine, Fluparoxan, Fluspirilene, GGR218231, Haloperidol, L741626, MDL100,907, ORG5222, Ocaperidone, Olanzapine, Pipamperone, Raclopride, Risperidone, S16924, S18327, S33084, Seroquel, Sertindole, Tiospirone, Yohimbine, Ziprasidone, Zotepine). The blue bars show the average contribution and the height of the green bars show one standard deviation. The average value and standard deviation were computed using all the 500 models designed (100 repeats and five blocks for each repeat). B Contributions of TM regions in α1-adrenoreceptors to the ligand affinity according to the proteochemometrics models created using RR in combination with the variable selection method PLSfilter. The contributions are shown for all the 18 receptors, for each receptor 12 bars corresponding to the 12 ligands are shown (in numerical order 1–12). The blue bars show the average contribution and the height of the green bars show one standard deviation. The average value and standard deviation were computed using all the 500 models (100 repeats and five blocks for each repeat).

Similar articles

Cited by

References

    1. Prusis P, Lundstedt T, Wikberg JE. Proteo-chemometrics analysis of MSH peptide binding to melanocortin receptors. Protein Eng. 2002;15:305–311. doi: 10.1093/protein/15.4.305. - DOI - PubMed
    1. Lapinsh M, Prusis P, Lundstedt T, Wikberg JE. Proteochemometrics modeling of the interaction of amine G-protein coupled receptors with a diverse set of ligands. Mol Pharmacol. 2002;61:1465–1475. doi: 10.1124/mol.61.6.1465. - DOI - PubMed
    1. Wikberg JE, Mutulis F, Mutule I, Veiksina S, Lapinsh M, Petrovska R, Prusis P. Melanocortin receptors: ligands and proteochemometrics modeling. Ann N Y Acad Sci. 2003;994:21–26. - PubMed
    1. Wikberg J, Lapinsh M, Prusis P. Chemogenomics in drug discovery – a medicinal chemistry perspective. Weinheim: Wiley-VCH; 2004. Proteochemometrics: A tool for modelling the molecular interaction space; pp. 289–309. - PubMed
    1. Brereton RG. Chemometrics: Data Analysis for the Laboratory and Chemical Plan. John Wiley & Sons; 2003.

Publication types

MeSH terms

LinkOut - more resources