Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Apr 5;9(4):806-11.
doi: 10.1039/c3mb70033j. Epub 2013 Feb 25.

Discrimination of soluble and aggregation-prone proteins based on sequence information

Affiliations

Discrimination of soluble and aggregation-prone proteins based on sequence information

Yaping Fang et al. Mol Biosyst. .

Abstract

Understanding the factors governing protein solubility is a key to grasp the mechanisms of protein solubility and may provide insight into protein aggregation and misfolding related diseases such as Alzheimer's disease. In this work, we attempt to identify factors important to protein solubility using feature selection. Firstly, we calculate 1438 features including physicochemical properties and statistics for each protein. Random Forest algorithm is used to select the most informative and the minimal subset of features based on their predictive performance. A predictive model is built based on 17 selected features. Compared with previous models, our model achieves better performance with a sensitivity of 0.82, specificity 0.85, ACC 0.84, AUC 0.91 and MCC 0.67. Furthermore, a model using a redundancy-reduced dataset (sequence identity <= 30%) achieves the same performance as the model without redundancy reduction. Our results provide not only a reliable model for predicting protein solubility but also a list of features important to protein solubility. The predictive model is implemented as a freely available web application at .

PubMed Disclaimer

Figures

Figure 1
Figure 1. Variable importance of F17 features
The prefix x represents the normalized absolute count values and c represents the absolute count values for each amino acid. The prefix num means the count of a specific atom. The other features are physicochemical properties of AAindex database.

References

    1. Pace CN, Trevino S, Prabhakaran E, Scholtz JM. Philos Trans R Soc Lond B Biol Sci. 2004;359:1225–1234. discussion 1234-1225. - PMC - PubMed
    1. Tjong H, Zhou HX. Biophys J. 2008;95:2601–2609. - PMC - PubMed
    1. Mandava N, Oberoi RK, Minocha M, Mitra AK. J Drug Deliv Sci Tec. 2010;20:89–99.
    1. Yee A, Pardee K, Christendat D, Savchenko A, Edwards AM, Arrowsmith CH. Accounts of chemical research. 2003;36:183–189. - PubMed
    1. Christendat D, Yee A, Dharamsi A, Kluger Y, Savchenko A, Cort JR, Booth V, Mackereth CD, Saridakis V, Ekiel I, Kozlov G, Maxwell KL, Wu N, McIntosh LP, Gehring K, Kennedy MA, Davidson AR, Pai EF, Gerstein M, Edwards AM, Arrowsmith CH. Nature structural biology. 2000;7:903–909. - PubMed

Publication types

LinkOut - more resources