SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics

P Bertone¹, Y Kluger, N Lan, D Zheng, D Christendat, A Yee, A M Edwards, C H Arrowsmith, G T Montelione, M Gerstein

Affiliations

PMID: 11433035
PMCID: PMC55760
DOI: 10.1093/nar/29.13.2884

SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics

P Bertone et al. Nucleic Acids Res. 2001.

. 2001 Jul 1;29(13):2884-98.

doi: 10.1093/nar/29.13.2884.

Authors

P Bertone¹, Y Kluger, N Lan, D Zheng, D Christendat, A Yee, A M Edwards, C H Arrowsmith, G T Montelione, M Gerstein

Affiliation

¹ Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520, USA.

PMID: 11433035
PMCID: PMC55760
DOI: 10.1093/nar/29.13.2884

Abstract

High-throughput structural proteomics is expected to generate considerable amounts of data on the progress of structure determination for many proteins. For each protein this includes information about cloning, expression, purification, biophysical characterization and structure determination via NMR spectroscopy or X-ray crystallography. It will be essential to develop specifications and ontologies for standardizing this information to make it amenable to retrospective analysis. To this end we created the SPINE database and analysis system for the Northeast Structural Genomics Consortium. SPINE, which is available at bioinfo.mbb.yale.edu/nesg or nesg.org, is specifically designed to enable distributed scientific collaboration via the Internet. It was designed not just as an information repository but as an active vehicle to standardize proteomics data in a form that would enable systematic data mining. The system features an intuitive user interface for interactive retrieval and modification of expression construct data, query forms designed to track global project progress and external links to many other resources. Currently the database contains experimental data on 985 constructs, of which 740 are drawn from Methanobacterium thermoautotrophicum, 123 from Saccharomyces cerevisiae, 93 from Caenorhabditis elegans and the remainder from other organisms. We developed a comprehensive set of data mining features for each protein, including several related to experimental progress (e.g. expression level, solubility and crystallization) and 42 based on the underlying protein sequence (e.g. amino acid composition, secondary structure and occurrence of low complexity regions). We demonstrate in detail the application of a particular machine learning approach, decision trees, to the tasks of predicting a protein's solubility and propensity to crystallize based on sequence features. We are able to extract a number of key rules from our trees, in particular that soluble proteins tend to have significantly more acidic residues and fewer hydrophobic stretches than insoluble ones. One of the characteristics of proteomics data sets, currently and in the foreseeable future, is their intermediate size ( approximately 500-5000 data points). This creates a number of issues in relation to error estimation. Initially we estimate the overall error in our trees based on standard cross-validation. However, this leaves out a significant fraction of the data in model construction and does not give error estimates on individual rules. Therefore, we present alternative methods to estimate the error in particular rules.

PubMed Disclaimer

Figures

**Figure 1**
Global project summary (A), statistics display (B) and database home page (C). The summary table can be dynamically reconfigured to present subsets of database entries, selected based on a number of simple parameters such as the target genome the protein originates from or the institution submitting the entries. An additional parameter, labeled ‘Attribute’, is used to narrow the search to entries whose experimental progress corresponds to a particular chronological stage in the table. For example, entries can be selected with an attribute of ‘secondary structure’, which will retrieve all constructs having secondary structure data derived through various biophysical characterization methods.

**Figure 2**
(A) Relationships between database system components. (B) Software module dependencies. The system was developed using the mySQL database engine for the Linux platform, in conjunction with two programming languages to facilitate low level database interaction and development of the user interface software: Perl 5.005 with the Perl Database Interface (DBI) module and the PHP 3.0 hypertext preprocessor. While syntactically similar, each language features distinct capabilities. Because the PHP interpreter is integrated as an Apache web server module, execution of PHP programs is generally faster than that of Perl-based CGI programs. This makes PHP well suited to interactive systems where timely server responses are a priority. While syntactically straightforward, the PHP language does not offer the extensive programming flexibility of Perl5. The core of the user interface system was therefore developed in PHP, while auxiliary components requiring more sophisticated functionality were implemented in Perl.

**Figure 3**
Core schema for the expanded database. Relational tables capture data for target proteins, their related expression constructs and separate sets of experimental parameters for expression, purification, X-ray crystallography, NMR and biophysical characterization. Additionally, a number of features have been developed to record laboratory management and transaction information (tables not shown).

**Figure 4**
Overwrite protection during the creation of new database records. The first step in creating a database record is assigning an identifier to the new entry. The identifier consists of three parts: a character to represent the target organism, a second character to indicate the institution from which the entry originates and a unique alphanumeric character string. When the entry identifier is selected the character string component may be chosen by the investigator if a proprietary nomenclature scheme is preferred; otherwise it can be automatically assigned by the system. In the latter case the unique identifier is the next available integer following the combination of target organism and institution codes. Whether the character string component is selected by the user or generated by the system, new construct identifiers are examined by the software and guaranteed not to conflict with those of existing entries, protecting against the accidental overwriting of data. Once a valid identifier has been assigned to the new database record the user may input relevant experimental parameter values using the construct entry form. Database records may be recalled and updated in two ways: by pressing the edit button available on its associated display page or by entering an expression construct identifier directly into a form accessible from the main database web interface. Once a record has been selected all of its existing field values are displayed in the construct editor, which shares a layout similar to the entry form. Users are then able to enter additional data and/or edit the current values associated with the construct and store the updated record in the database.

**Figure 5**
Database searching and record retrieval. Users can construct complex Boolean searches on a number of database key fields with an intuitive form (A); the form elements are then parsed internally and an SQL query is created based on the values of the form elements and executed against the database. The search results are then summarized in a table, displaying a user-selectable number of entries per page (B). The query terms also appear above the table in a pseudo-English format, to assist in performing effective searches. Selecting an entry from the table displays the expression construct record in a separate web page (C), which contains all the database fields associated with the record, in addition to a number of links to external resources (D).

**Figure 6**
Conceptual structure of the decision tree model used for classification problems. Instances are sorted from root to leaf nodes, based on a number of properties defined at each node by splitting variables. Pictured is a decision tree built to predict the tendency for protein crystallization based on sequence features such as amino acid content, hydrophobicity and homology to other sequences. The nodes of the tree are represented by ellipses; the values to the left of each node indicate the number of proteins which are unable to crystallize, while those to the right denote the crystallized examples. The splitting threshold for each node appears directly under its associated variable. The decision tree algorithm calculates all possible splitting thresholds for each variable, selecting each variable and its threshold to optimize the homogeneity of the two subsequent nodes. When a variable v is split, the right branch is assigned to v < threshold and the left branch corresponds to v > threshold.

**Figure 7**
Decision trees built for solubility prediction. Tree pruning methods are designed to reduce the number of nodes and arrive at the smallest tree whose error rate performance is closest to the minimal error rate of the entire tree. (A and B) Uppermost levels of two decision trees, highlighting paths for classification rules. The original trees from which these subsets of nodes were derived are inset to the right. Decision tree (A) was built using the entire set of 562 proteins, while (B) was trained and tested on discrete randomized subsets of the proteomics data: 375 proteins were used for training and the remaining 187 for testing. Soluble and insoluble proteins are indicated by the numbers to the right and left of each node, respectively. In the case of decision tree (B) two values are used for each class, corresponding to the training (left) and testing (right) phases. Decision pathways which terminate in highly homogeneous nodes (mostly dark, soluble; mostly white, insoluble) and are not distant from the root define more robust rules which can generalize against unseen examples. Heterogeneous nodes could be further split by extending the tree downward, improving the error rate but overfitting the training set. The pathways indicated in each decision tree represent sets of rules. For instance, the right branching path of example (A) (indicated in green) selects mostly soluble proteins, based on the condition that the combined compositions of acidic residues [C(DE)] in their sequences exceed 18%. The left branching path of the same tree (in red) outlines the following set of conditions and classifies proteins which are likely to be insoluble: C(DE) < 18%; presence of a stretch of amino acids with average hydrophobicity < –0.78 kcal/mol (labeled Hphobe); fewer than 16% acidic amino acids and their amides [C(DENQ)]. (C) Thresholds at which each node partitions the input vectors in the upper levels of the two decision trees. At each level the nodes are listed sequentially from left to right [e.g. at level 2 in tree (A) the left-most node represents the splitting variable Hphobe having a threshold of –0.78 on the GES hydrophobicity scale, followed by a node in the right-most branch of the tree corresponding to the splitting variable Length with a threshold of 95 amino acids].

See this image and copyright information in PMC

References

1. Benson D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and Wheeler,D.L. (2000) GenBank. Nucleic Acids Res., 28, 15–18. - PMC - PubMed
1. Tateno Y., Miyazaki,S., Ota,M., Sugawara,H. and Gojobori,T. (2000) DNA bank of Japan (DDBJ) in collaboration with mass sequencing teams. Nucleic Acids Res., 28, 24–26. - PMC - PubMed
1. Baker W., van der Broek,A., Camon,E., Hingamp,P., Sterk,P., Stoesser,G. and Tuli,M.A. (2000) The EMBL nucleotide sequence database. Nucleic Acids Res., 28, 19–23. - PMC - PubMed
1. Barker W.C., Garavelli,J.S., Huang,H., McGarvey,P.B., Orcutt,B., Srinivasarao,G.Y., Xiao,C., Yeh,L.S., Ledley,R.S., Janda,J.F., Pfeiffer,F., Mewes,H.W., Tsugita,A. and Wu,C. (2000) The Protein Information Resource (PIR). Nucleic Acids Res., 28, 41–44. - PMC - PubMed
1. Bairoch A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics

Affiliation

SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases