Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 24;21(1):12.
doi: 10.1186/s12915-023-01510-8.

Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE

Affiliations

Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE

Chao Wang et al. BMC Biol. .

Abstract

Background: Protein solubility is a precondition for efficient heterologous protein expression at the basis of most industrial applications and for functional interpretation in basic research. However, recurrent formation of inclusion bodies is still an inevitable roadblock in protein science and industry, where only nearly a quarter of proteins can be successfully expressed in soluble form. Despite numerous solubility prediction models having been developed over time, their performance remains unsatisfactory in the context of the current strong increase in available protein sequences. Hence, it is imperative to develop novel and highly accurate predictors that enable the prioritization of highly soluble proteins to reduce the cost of actual experimental work.

Results: In this study, we developed a novel tool, DeepSoluE, which predicts protein solubility using a long-short-term memory (LSTM) network with hybrid features composed of physicochemical patterns and distributed representation of amino acids. Comparison results showed that the proposed model achieved more accurate and balanced performance than existing tools. Furthermore, we explored specific features that have a dominant impact on the model performance as well as their interaction effects.

Conclusions: DeepSoluE is suitable for the prediction of protein solubility in E. coli; it serves as a bioinformatics tool for prescreening of potentially soluble targets to reduce the cost of wet-experimental studies. The publicly available webserver is freely accessible at http://lab.malab.cn/~wangchao/softs/DeepSoluE/ .

Keywords: Feature embedding; Interpretation; Machine learning; Protein solubility.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Comparison of different feature selection methods. AE Metrices value and feature dimensions based on five feature selection strategies. F Feature dimensions of optimal feature subsets based on the metric AUC of the five feature optimization methods. GA: genetic algorithm.
Fig. 2
Fig. 2
Performance comparison of DeepSoluE and 11 conventional machine learning methods
Fig. 3
Fig. 3
Feature contribution and dependency analysis. A The 20 most important features. B Summary plot for SHAP values. For each feature, one point corresponds to a single sample. The SHAP value along the x-axis represents the impact that feature had on the model’s output for that specific sample. Features in the higher position in the plot indicate the more important it is for the model. CJ SHAP dependence plots. These plots show the effect that a single feature has on the model predictions and the interaction effects across features. Each point corresponds to an individual sample, the value along the x-axis corresponds to the feature value, and the color represents the value of the interacting feature
Fig. 4
Fig. 4
The DeepSoluE workflow. A Physicochemical feature encoding, feature optimization, and distributed representation of protein sequences. B Neural network architectures of DeepSoluE; FC, fully connected layer

Similar articles

Cited by

References

    1. Wilkinson DL, Harrison RG. Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology. 1991;9(5):443–448. - PubMed
    1. Manning MC, Chou DK, Murphy BM, Payne RW, Katayama DS. Stability of protein pharmaceuticals: An update. Pharm Res. 2010;27(4):544–575. doi: 10.1007/s11095-009-0045-6. - DOI - PubMed
    1. Ventura S. Sequence determinants of protein aggregation: tools to increase protein solubility. Microb Cell Fact. 2005;4(1):11. doi: 10.1186/1475-2859-4-11. - DOI - PMC - PubMed
    1. Chiti F, Dobson CM. Protein misfolding, amyloid formation, and human disease: A summary of progress over the last decade. In: Kornberg RD, editor. Annu Rev Biochem. 2017. pp. 27–68. - PubMed
    1. Bhandari BK, Gardner PP, Lim CS. Solubility-Weighted Index: fast and accurate prediction of protein solubility. Bioinformatics. 2020;36(18):4691–4698. doi: 10.1093/bioinformatics/btaa578. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources