Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Mar;11(1):71-80.
doi: 10.1007/s10969-010-9080-0. Epub 2010 Feb 23.

Predicting protein crystallization propensity from protein sequence

Affiliations

Predicting protein crystallization propensity from protein sequence

György Babnigg et al. J Struct Funct Genomics. 2010 Mar.

Abstract

The high-throughput structure determination pipelines developed by structural genomics programs offer a unique opportunity for data mining. One important question is how protein properties derived from a primary sequence correlate with the protein's propensity to yield X-ray quality crystals (crystallizability) and 3D X-ray structures. A set of protein properties were computed for over 1,300 proteins that expressed well but were insoluble, and for approximately 720 unique proteins that resulted in X-ray structures. The correlation of the protein's iso-electric point and grand average hydropathy (GRAVY) with crystallizability was analyzed for full length and domain constructs of protein targets. In a second step, several additional properties that can be calculated from the protein sequence were added and evaluated. Using statistical analyses we have identified a set of the attributes correlating with a protein's propensity to crystallize and implemented a Support Vector Machine (SVM) classifier based on these. We have created applications to analyze and provide optimal boundary information for query sequences and to visualize the data. These tools are available via the web site http://bioinformatics.anl.gov/cgi-bin/tools/pdpredictor .

PubMed Disclaimer

Figures

Figure 1
Figure 1
Iso-electric point and GRAVY distribution of MCSG targets. The pI (a) and hydrophobicity (b) of MCSG-INSOLUBLE (dark) and MCSG-PDB (light) targets were calculated and binned according to the OB-score matrix bin values [12]
Figure 2
Figure 2
Comparison of the Z-score matrix derived from MCSG targets and the OB-score matrix. A web application was built for binning two-dimensional data (a) and for the display of two-dimensional matrices (b). The OB-score distribution is shown for pI 3–13 and GRAVY −1.4 to 1.4 as reported earlier by Overton and Barton [12]. The selected insoluble (dark) and ‘In PDB’ targets (light) were binned according to the MCSG Z-score (c) and the OB-score (d)
Figure 3
Figure 3
Subregion design. A web application was built for calculating the pI, GRAVY, and the corresponding MCSG derived Z-score for all possible subregions of an input sequence. The resulting matrix is displayed for pI (a), GRAVY (b) and the MCSG Z-score (c) using a predefined color scale
Figure 4
Figure 4
Constructs of a two-component sensor histidine kinase. Several constructs of a two-component sensor histidine kinase from Bacillus subtilis subsp. subtilis str. 168 (gi|221310851) were designed and tested in the pipeline. The expression and solubility data of 3 selected subregions are shown. The bottom schematics depicts the topology of the protein as predicted by Phobius [31]
Figure 5
Figure 5
The Support Vector Machine approach. The amino acid attributes selected above were used to calculate protein sequence properties for a balanced set of insoluble targets and those targets deposited into the PDB. A repeated random sub-sampling validation was performed using a 60/40 split of the non-redundant data. The SVM was trained with 60% of the dataset and the remainder was used for testing. The true positive and false positive rate is shown in (a) with a 68% AUC-ROC. The MCC was calculated and displayed along the accuracy and average accuracy in relation to the true positive rate (b)

Similar articles

Cited by

References

    1. Gao X, et al. High-throughput limited proteolysis/mass spectrometry for protein domain elucidation. J Struct Funct Genomics. 2005;6(2–3):129–134. - PubMed
    1. Koth CM, et al. Use of limited proteolysis to identify protein domains suitable for structural analysis. Methods Enzymol. 2003;368:77–84. - PubMed
    1. Dong A, et al. In situ proteolysis for protein crystallization and structure determination. Nat Methods. 2007;4(12):1019–1021. - PMC - PubMed
    1. Goldschmidt L, et al. Toward rational protein crystallization: a web server for the design of crystallizable protein variants. Protein Sci. 2007;16(8):1569–1576. - PMC - PubMed
    1. Kim Y, et al. Large-scale evaluation of protein reductive methylation for improving protein crystallization. Nat Methods. 2008;5(10):853–854. - PMC - PubMed

Publication types

LinkOut - more resources