Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Dec 13:10:414.
doi: 10.1186/1471-2105-10-414.

Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences

Affiliations

Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences

Marcin J Mizianty et al. BMC Bioinformatics. .

Abstract

Background: Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences.

Results: The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes.

Conclusions: The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Cartoon structures of proteins that cover the seven structural classes defined in the SCOP database. Panel a shows structure of protein with PDB identifier 1mty, b for 1a8d, c for 2f62, d for 2bf5, e for 1vqq, f for 1u7g, and g for 4hir. Helices are shown in light grey, coils in dark gray, and strands in black.
Figure 2
Figure 2
Distribution of sequences with respect to their maximal pairwise sequence identity in the D498 dataset.
Figure 3
Figure 3
Diagram of the proposed MODAS method.
Figure 4
Figure 4
Scatter plots for two representative features for each structural class (left column) and helix and strand contents (right column) for a) all-α; b) all-β; c) α/β; d) α+β; e) multi-domain; f) membrane and cell surface proteins; and g) small proteins classes. The plots were computed on the ASTRALtrainingdataset and they use markers with colors and shapes that indicate the class and number of protein chains for a given combination of the values of the two features, respectively. The larger the marker is the more chains are found for the corresponding values of the two features. The darker the shading of the marker is the larger the fraction of the chains that correspond to the target class is for the given values of the two features.

Similar articles

Cited by

References

    1. Chou KC, Wei D, Du Q, Sirois S, Zhong W. Progress in computational approach to drug development against SARS. Curr Med Chem. 2006;13(32):63–70. - PubMed
    1. Chou KC. Structural bioinformatics and its impact to biomedical science. Curr Med Chem. 2004;11(21):05–34. - PubMed
    1. Bujnicki JM. Protein-structure prediction by recombination of fragments. Chembiochem. 2006;7(1):19–27. doi: 10.1002/cbic.200500235. - DOI - PubMed
    1. Floudas CA. Computational methods in protein structure prediction. Biotechnol Bioeng. 2007;97(2):207–213. doi: 10.1002/bit.21411. - DOI - PubMed
    1. Kurgan LA, Cios KJ, Zhang H, Zhang T, Chen K, Shen S, Ruan J. Sequence-based methods for real value predictions of protein structure. Current Bioinformatics. 2008;3(3):183–196. doi: 10.2174/157489308785909197. - DOI

LinkOut - more resources