. 2005 Mar 10:6:50.

doi: 10.1186/1471-2105-6-50.

Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling

Eva Freyhult¹, Peteris Prusis, Maris Lapinsh, Jarl E S Wikberg, Vincent Moulton, Mats G Gustafsson

Affiliations

PMID: 15760465
PMCID: PMC555743
DOI: 10.1186/1471-2105-6-50

Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling

Eva Freyhult et al. BMC Bioinformatics. 2005.

. 2005 Mar 10:6:50.

doi: 10.1186/1471-2105-6-50.

Authors

Eva Freyhult¹, Peteris Prusis, Maris Lapinsh, Jarl E S Wikberg, Vincent Moulton, Mats G Gustafsson

Affiliation

¹ The Linnaeus Centre for Bioinformatics, Uppsala University, Box 598, S-751 24 Uppsala, Sweden. Eva.Freyhult@lcb.uu.se

PMID: 15760465
PMCID: PMC555743
DOI: 10.1186/1471-2105-6-50

Abstract

Background: Proteochemometrics is a new methodology that allows prediction of protein function directly from real interaction measurement data without the need of 3D structure information. Several reported proteochemometric models of ligand-receptor interactions have already yielded significant insights into various forms of bio-molecular interactions. The proteochemometric models are multivariate regression models that predict binding affinity for a particular combination of features of the ligand and protein. Although proteochemometric models have already offered interesting results in various studies, no detailed statistical evaluation of their average predictive power has been performed. In particular, variable subset selection performed to date has always relied on using all available examples, a situation also encountered in microarray gene expression data analysis.

Results: A methodology for an unbiased evaluation of the predictive power of proteochemometric models was implemented and results from applying it to two of the largest proteochemometric data sets yet reported are presented. A double cross-validation loop procedure is used to estimate the expected performance of a given design method. The unbiased performance estimates (P2) obtained for the data sets that we consider confirm that properly designed single proteochemometric models have useful predictive power, but that a standard design based on cross validation may yield models with quite limited performance. The results also show that different commercial software packages employed for the design of proteochemometric models may yield very different and therefore misleading performance estimates. In addition, the differences in the models obtained in the double CV loop indicate that detailed chemical interpretation of a single proteochemometric model is uncertain when data sets are small.

Conclusion: The double CV loop employed offer unbiased performance estimates about a given proteochemometric modelling procedure, making it possible to identify cases where the proteochemometric design does not result in useful predictive models. Chemical interpretations of single proteochemometric models are uncertain and should instead be based on all the models selected in the double CV loop employed here.

PubMed Disclaimer

Figures

**Figure 1**
**Overview of double loop.** An overview of the double loop CV procedure used to obtain the desired unbiased performance estimate P². In the embedded inner CV loop, the most promising model is selected which yields the largest unbiased performance estimate, Q². In the outer CV loop, external test examples are kept outside the inner loop and are only used to test the most promising model found in the inner loop. Note that the estimate P²reflects the average performance of the modelling procedure employed in the inner loop and that the estimate is based on many different models designed in the inner loop.

**Figure 3**
**External prediction for alpha data set.** External predictions for alpha data set sorted according to growing values of pK_i. The figure shows the experimental pK_ivalues (dashed line) and the mean value of the predicted pK_ivalues (solid line) with a 95 % confidence interval (dotted lines). The predictions shown in the figure are from the PLS modelling after variable selection using PLSfilter.

**Figure 4**
**Comparison of software.** Q²values obtained using different software for the prediction of affinities based on PLS models, without variable selection, for the amine data set. Between one and ten latent variables were used and SIMCA (dashed line), UNSCRAMBLER (dash-dotted line), GOLPE (dotted line) and MATLAB (solid line) were used to both build the models and evaluate them by computing Q²values. The SIMCA Q²values are much higher than the other Q²values.

**Figure 5**
**Hit rates.** A The hit rates for the receptor blocks in the amine data set. The figure shows for each transmembrane region the hit rates for the original receptor descriptors, the cross term descriptors and the absolute valued cross term descriptors involving that transmembrane region. B The hit rates for the ligand blocks in the amine data set. The figure shows for each ligand descriptor block the hit rates for the original receptor descriptors, the cross term descriptors and the absolute valued cross term descriptors involving that ligand descriptor block. C and D The corresponding hit rates for the alpha data set. The blue bars show the hit rates computed for the PLS models using PLSfilter, the cyan bars show the hit rates computed for the PLS models using corrfilter, the red bars show the hit rates computed for the RR models using PLSfilter, and the yellow bars show the hit rates computed for the RR models using corrfilter

**Figure 6**
**Detailed contributions to affinity.** A Contributions of TM regions in amine GPCRs to the ligand affinity according to the proteochemometrics models created using PLS in combination with the variable selection method PLSfilter. The contributions are shown for all the 21 receptors, for each receptor 23 bars corresponding to the 23 ligands are shown (in alphabetical order i.e., Amperozide, Clozapine, Fluparoxan, Fluspirilene, GGR218231, Haloperidol, L741626, MDL100,907, ORG5222, Ocaperidone, Olanzapine, Pipamperone, Raclopride, Risperidone, S16924, S18327, S33084, Seroquel, Sertindole, Tiospirone, Yohimbine, Ziprasidone, Zotepine). The blue bars show the average contribution and the height of the green bars show one standard deviation. The average value and standard deviation were computed using all the 500 models designed (100 repeats and five blocks for each repeat). B Contributions of TM regions in α₁-adrenoreceptors to the ligand affinity according to the proteochemometrics models created using RR in combination with the variable selection method PLSfilter. The contributions are shown for all the 18 receptors, for each receptor 12 bars corresponding to the 12 ligands are shown (in numerical order 1–12). The blue bars show the average contribution and the height of the green bars show one standard deviation. The average value and standard deviation were computed using all the 500 models (100 repeats and five blocks for each repeat).

See this image and copyright information in PMC

Cited by

Kinome-wide interaction modelling using alignment-based and alignment-independent approaches for kinase description and linear and non-linear data analysis techniques.
Lapins M, Wikberg JE. Lapins M, et al. BMC Bioinformatics. 2010 Jun 22;11:339. doi: 10.1186/1471-2105-11-339. BMC Bioinformatics. 2010. PMID: 20569422 Free PMC article.
Structural and conformational determinants of macrocycle cell permeability.
Over B, Matsson P, Tyrchan C, Artursson P, Doak BC, Foley MA, Hilgendorf C, Johnston SE, Lee MD 4th, Lewis RJ, McCarren P, Muncipinto G, Norinder U, Perry MW, Duvall JR, Kihlberg J. Over B, et al. Nat Chem Biol. 2016 Dec;12(12):1065-1074. doi: 10.1038/nchembio.2203. Epub 2016 Oct 17. Nat Chem Biol. 2016. PMID: 27748751
Chagas Disease: Perspectives on the Past and Present and Challenges in Drug Discovery.
Mansoldo FRP, Carta F, Angeli A, Cardoso VDS, Supuran CT, Vermelho AB. Mansoldo FRP, et al. Molecules. 2020 Nov 23;25(22):5483. doi: 10.3390/molecules25225483. Molecules. 2020. PMID: 33238613 Free PMC article. Review.
The C1C2: a framework for simultaneous model selection and assessment.
Eklund M, Spjuth O, Wikberg JE. Eklund M, et al. BMC Bioinformatics. 2008 Sep 2;9:360. doi: 10.1186/1471-2105-9-360. BMC Bioinformatics. 2008. PMID: 18761753 Free PMC article.
Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation.
Baumann D, Baumann K. Baumann D, et al. J Cheminform. 2014 Nov 26;6(1):47. doi: 10.1186/s13321-014-0047-1. eCollection 2014. J Cheminform. 2014. PMID: 25506400 Free PMC article.

See all "Cited by" articles

References

1. Prusis P, Lundstedt T, Wikberg JE. Proteo-chemometrics analysis of MSH peptide binding to melanocortin receptors. Protein Eng. 2002;15:305–311. doi: 10.1093/protein/15.4.305. - DOI - PubMed
1. Lapinsh M, Prusis P, Lundstedt T, Wikberg JE. Proteochemometrics modeling of the interaction of amine G-protein coupled receptors with a diverse set of ligands. Mol Pharmacol. 2002;61:1465–1475. doi: 10.1124/mol.61.6.1465. - DOI - PubMed
1. Wikberg JE, Mutulis F, Mutule I, Veiksina S, Lapinsh M, Petrovska R, Prusis P. Melanocortin receptors: ligands and proteochemometrics modeling. Ann N Y Acad Sci. 2003;994:21–26. - PubMed
1. Wikberg J, Lapinsh M, Prusis P. Chemogenomics in drug discovery – a medicinal chemistry perspective. Weinheim: Wiley-VCH; 2004. Proteochemometrics: A tool for modelling the molecular interaction space; pp. 289–309. - PubMed
1. Brereton RG. Chemometrics: Data Analysis for the Laboratory and Chemical Plan. John Wiley & Sons; 2003.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling

Affiliation

Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources