Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
- PMID: 23282103
- PMCID: PMC3521471
- DOI: 10.1186/1471-2105-13-S17-S3
Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
Abstract
Background: Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods.
Results: This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble.
Conclusions: The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role.
Availability: The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/.
Figures







Similar articles
-
PVPred-SCM: Improved Prediction and Analysis of Phage Virion Proteins Using a Scoring Card Method.Cells. 2020 Feb 3;9(2):353. doi: 10.3390/cells9020353. Cells. 2020. PMID: 32028709 Free PMC article.
-
SCMCRYS: predicting protein crystallization using an ensemble scoring card method with estimating propensity scores of P-collocated amino acid pairs.PLoS One. 2013 Sep 3;8(9):e72368. doi: 10.1371/journal.pone.0072368. eCollection 2013. PLoS One. 2013. PMID: 24019868 Free PMC article.
-
SCMMTP: identifying and characterizing membrane transport proteins using propensity scores of dipeptides.BMC Genomics. 2015;16 Suppl 12(Suppl 12):S6. doi: 10.1186/1471-2164-16-S12-S6. Epub 2015 Dec 9. BMC Genomics. 2015. PMID: 26677931 Free PMC article.
-
Characterization and design of dipeptide media formulation for scalable therapeutic production.Appl Microbiol Biotechnol. 2025 Jan 14;109(1):7. doi: 10.1007/s00253-024-13402-0. Appl Microbiol Biotechnol. 2025. PMID: 39808320 Free PMC article. Review.
-
Proteinogenic dipeptides, an emerging class of small-molecule regulators.Curr Opin Plant Biol. 2023 Oct;75:102395. doi: 10.1016/j.pbi.2023.102395. Epub 2023 Jun 11. Curr Opin Plant Biol. 2023. PMID: 37311365 Review.
Cited by
-
A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli.BMC Bioinformatics. 2014 May 8;15:134. doi: 10.1186/1471-2105-15-134. BMC Bioinformatics. 2014. PMID: 24885721 Free PMC article.
-
Protein Design: From the Aspect of Water Solubility and Stability.Chem Rev. 2022 Sep 28;122(18):14085-14179. doi: 10.1021/acs.chemrev.1c00757. Epub 2022 Aug 3. Chem Rev. 2022. PMID: 35921495 Free PMC article. Review.
-
PVPred-SCM: Improved Prediction and Analysis of Phage Virion Proteins Using a Scoring Card Method.Cells. 2020 Feb 3;9(2):353. doi: 10.3390/cells9020353. Cells. 2020. PMID: 32028709 Free PMC article.
-
Isolation, identification and in silico analysis of alpha-amylase gene of Aspergillus niger strain CSA35 obtained from cassava undergoing spoilage.Biochem Biophys Rep. 2018 Apr 6;14:35-42. doi: 10.1016/j.bbrep.2018.03.006. eCollection 2018 Jul. Biochem Biophys Rep. 2018. PMID: 29872732 Free PMC article.
-
PSR-MAPMS: A new approach for the interpretable prediction of myelin autoantigenic peptides in multiple sclerosis using multi-source propensity scores.Protein Sci. 2025 Aug;34(8):e70010. doi: 10.1002/pro.70010. Protein Sci. 2025. PMID: 40673425 Free PMC article.
References
-
- Idicula-Thomas S, Kulkarni AJ, Kulkarni BD, Jayaraman VK, Balaji PV. A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics. 2006;22(3):278–284. doi: 10.1093/bioinformatics/bti810. - DOI - PubMed
-
- Jenkins TM, Hickman AB, Dyda F, Ghirlando R, Davies DR, Craigie R. Catalytic domain of human immunodeficiency virus type 1 integrase: identification of a soluble mutant by systematic replacement of hydrophobic residues. Proc Natl Acad Sci USA. 1995;92(13):6057–6061. doi: 10.1073/pnas.92.13.6057. - DOI - PMC - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources