Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 15;36(18):4691-4698.
doi: 10.1093/bioinformatics/btaa578.

Solubility-Weighted Index: fast and accurate prediction of protein solubility

Affiliations

Solubility-Weighted Index: fast and accurate prediction of protein solubility

Bikash K Bhandari et al. Bioinformatics. .

Abstract

Motivation: Recombinant protein production is a widely used technique in the biotechnology and biomedical industries, yet only a quarter of target proteins are soluble and can therefore be purified.

Results: We have discovered that global structural flexibility, which can be modeled by normalized B-factors, accurately predicts the solubility of 12 216 recombinant proteins expressed in Escherichia coli. We have optimized these B-factors, and derived a new set of values for solubility scoring that further improves prediction accuracy. We call this new predictor the 'Solubility-Weighted Index' (SWI). Importantly, SWI outperforms many existing protein solubility prediction tools. Furthermore, we have developed 'SoDoPE' (Soluble Domain for Protein Expression), a web interface that allows users to choose a protein region of interest for predicting and maximizing both protein expression and solubility.

Availability and implementation: The SoDoPE web server and source code are freely available at https://tisigner.com/sodope and https://github.com/Gardner-BinfLab/TISIGNER-ReactJS, respectively. The code and data for reproducing our analysis can be found at https://github.com/Gardner-BinfLab/SoDoPE_paper_2020.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Global structural flexibility outperforms other standard protein sequence properties in protein solubility prediction. ROC analysis of the standard protein sequence features for predicting the solubility of 12 216 recombinant proteins expressed in E.coli (the PSI: Biology dataset). The ROC curves are shown in two separate panels for clarity. AUC scores (perfect = 1.00, random = 0.50) are shown in parentheses. Dashed lines denote the performance of random classifiers. See also Supplementary Figure S2 and Table S2. AUC, Area Under the ROC Curve; GRAVY, Grand Average of Hydropathy; PSI: Biology, Protein Structure Initiative: Biology; ROC, Receiver Operating Characteristic
Fig. 2.
Fig. 2.
Derivation of the SWI. (A) Flow chart shows an iterative refinement of the weights of amino acid residues for solubility prediction. Each cross-validation step used separate sequence similarity clusters for training and testing. Furthermore, bootstrapping was used to resample each training set, avoiding training and testing on similar sequences. The solubility scores of protein sequences were calculated using a sequence composition scoring approach. These scores were used to compute the AUC scores for training and test datasets. (B) Training and test performance of solubility prediction using optimized weights for 20 amino acid residues in a 10-fold cross-validation (mean AUC ± standard deviation). Related data and figures are available as Supplementary Table S3 and Figures S4 and S5. (C) Comparison between the 20 initial and final weights for amino acid residues. The final weights W=Vi,1i10 were used to calculate the solubility score of a protein sequence (SWI) in the four subsequent analyses. Filled circles, which represent amino acid residues, are colored by hydrophobicity (Kyte and Doolittle, 1982). Solid black circles denote aromatic residues phenylalanine (F), tyrosine (Y), tryptophan (W). Dotted diagonal line represents no change in weight. See also Supplementary Table S4. AUC, Area Under the ROC Curve; ROC, Receiver Operating Characteristic. (Color version of this figure is available at Bioinformatics online.)
Fig. 3.
Fig. 3.
SWI strongly correlates with protein solubility. (A) Correlation matrix plot of the solubility of recombinant proteins expressed in E.coli and their standard protein sequence properties and SWI. These recombinant proteins are the PSI: Biology targets (N = 12 216) with a binary solubility status of ‘Protein_Soluble’ or ‘Tested_Not_Soluble’. Related data are available as Supplementary Table S5. (B) Correlation matrix plot of the solubility percentages of E.coli proteins and their standard protein sequence properties and SWI. The solubility percentages were previously determined using an E.coli cell-free system (eSOL, N = 3198). Related data are available as Supplementary Table S6. GRAVY, Grand Average of Hydropathy; PSI: Biology, Protein Structure Initiative: Biology; Rs, Spearman’s rho; SWI, Solubility-Weighted Index
Fig. 4.
Fig. 4.
SWI outperforms existing protein solubility prediction tools. (A) Prediction accuracy of solubility prediction tools using the above cross-validation sets (Fig. 2A). For SWI, the test AUC scores were calculated from a 10-fold cross-validation (i.e. a boxplot representation of Fig. 2B). For other tools, no cross-validations were done as the AUC scores were calculated directly from the individual subsets used for cross-validation. CamSol and ccSOL omics are only available as web servers (no fill colors). (B) Wall time of protein solubility prediction tools per sequence (log scale). All command line tools were run three times using 10 sequences selected from the PSI: Biology and eSOL datasets. Related data are available as Supplementary Table S7. AUC, Area Under the ROC Curve; PSI: Biology, Protein Structure Initiative: Biology; ROC, Receiver Operating Characteristic; SWI, Solubility-Weighted Index; s, seconds. (Color version of this figure is available at Bioinformatics online.)

References

    1. Acton T.B. et al. (2005) Robotic cloning and protein production platform of the northeast structural genomics consortium. Methods Enzymol., 394, 210–243. - PubMed
    1. Agostini F. et al. (2014) ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli. Bioinformatics, 30, 2975–2977. - PMC - PubMed
    1. Åslund,F. Beckwith,J. (1999) The Thioredoxin Superfamily: Redundancy, Specificity, and Gray-Area Genomics. J. Bacteriol., 181, 1375–1379. - PMC - PubMed
    1. Bhandari B.K. et al. (2019) Highly accessible translation initiation sites are predictive of successful heterologous protein expression. BioRxiv, 726752.
    1. Bhaskaran R., Ponnuswamy P.K. (1998) Positional flexibilities of amino acid residues in globular proteins. Int. J. Pept. Protein Res., 32, 241–255. - PubMed

Publication types