Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;13 Suppl 17(Suppl 17):S3.
doi: 10.1186/1471-2105-13-S17-S3. Epub 2012 Dec 13.

Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition

Affiliations

Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition

Hui-Ling Huang et al. BMC Bioinformatics. 2012.

Abstract

Background: Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods.

Results: This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble.

Conclusions: The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role.

Availability: The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The system flowchart of the proposed scoring matrix method.
Figure 2
Figure 2
Heat map of the optimized solubility scoring matrix of dipeptides.
Figure 3
Figure 3
The histogram of sequence solubility scores in the test data set. (a) statistical SSM without optimization (b) optimized SSM.
Figure 4
Figure 4
The test accuracies for various sizes of uncertainty regions.
Figure 5
Figure 5
The correlation coefficient R = 0.51 between the optimized SSM of amino acids and the α-helical propensity.
Figure 6
Figure 6
The correlation coefficient R = 0.76 between the optimized SSM of amino acids and the property KUMS000103, the distribution of residues in the α-helices in thermophilic proteins.
Figure 7
Figure 7
Distribution of dipeptide scores on the positions of two typical sequences. The protein 1FSZ_A with length 372 has a solubility score 499.92 predicted as a soluble protein, and Q5FZH9 with length 352 has a score 383.73 predicted as an insoluble protein where the threshold value is 463.79.

Similar articles

Cited by

References

    1. Pedelacq JD, Piltch E, Liong EC, Berendzen J, Kim CY, Rho BS, Park MS, Terwilliger TC, Waldo GS. Engineering soluble proteins for structural genomics. Nat Biotechnol. 2002;20(9):927–932. doi: 10.1038/nbt732. - DOI - PubMed
    1. Trevino SR, Scholtz JM, Pace CN. Amino acid contribution to protein solubility: Asp, Glu, and Ser contribute more favorably than the other hydrophilic amino acids in RNase Sa. J Mol Biol. 2007;366(2):449–460. doi: 10.1016/j.jmb.2006.10.026. - DOI - PMC - PubMed
    1. Idicula-Thomas S, Kulkarni AJ, Kulkarni BD, Jayaraman VK, Balaji PV. A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics. 2006;22(3):278–284. doi: 10.1093/bioinformatics/bti810. - DOI - PubMed
    1. Dale GE, Broger C, Langen H, D'Arcy A, Stuber D. Improving protein solubility through rationally designed amino acid replacements: solubilization of the trimethoprim-resistant type S1 dihydrofolate reductase. Protein Eng. 1994;7(7):933–939. doi: 10.1093/protein/7.7.933. - DOI - PubMed
    1. Jenkins TM, Hickman AB, Dyda F, Ghirlando R, Davies DR, Craigie R. Catalytic domain of human immunodeficiency virus type 1 integrase: identification of a soluble mutant by systematic replacement of hydrophobic residues. Proc Natl Acad Sci USA. 1995;92(13):6057–6061. doi: 10.1073/pnas.92.13.6057. - DOI - PMC - PubMed

Publication types

LinkOut - more resources