Comparison of sequence and structure-based datasets for nonredundant structural data mining

Carmen K Chu¹, Lina L Feng, Merridee A Wouters

Affiliations

PMID: 16001417
DOI: 10.1002/prot.20505

Comparative Study

Comparison of sequence and structure-based datasets for nonredundant structural data mining

Carmen K Chu et al. Proteins. 2005.

. 2005 Sep 1;60(4):577-83.

doi: 10.1002/prot.20505.

Authors

Carmen K Chu¹, Lina L Feng, Merridee A Wouters

Affiliation

¹ Computational Biology and Bioinformatics Program, Victor Chang Cardiac Research Institute, Sydney, NSW, Australia.

PMID: 16001417
DOI: 10.1002/prot.20505

Abstract

Structural data mining studies attempt to deduce general principles of protein structure from solved structures deposited in the protein data bank (PDB). The entire database is unsuitable for such studies because it is not representative of the ensemble of protein folds. Given that novel folds continue to be unearthed, some folds are currently unrepresented in the PDB while other folds are overrepresented. Overrepresentation can easily be avoided by filtering the dataset. PDB_SELECT is a well-used representative subset of the PDB that has been deduced by sequence comparison. Specifically, structures with sequences that exhibit a pairwise sequence identity above a threshold value are weeded from the dataset. Although length criteria for pairwise alignments have a structural basis, this automated method of pruning is essentially sequence-based and runs into problems in the twilight zone, possibly resulting in some folds being overrepresented. The value-added structure databases SCOP and CATH are also a potential source of a nonredundant dataset. Here we compare the sequence-derived dataset PDB_SELECT with the structural databases SCOP (Structural Classification Of Proteins) and CATH (Class-Architecture-Topology-Homology). We show that some folds remain overrepresented in the PDB_SELECT dataset while other folds are not represented at all. However, SCOP and CATH also have their own problems such as the labor-intensiveness of the update process and the problem of determining whether all folds are equally or sufficiently distant. We discuss areas where further work is required.

PubMed Disclaimer

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- Wiley

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparison of sequence and structure-based datasets for nonredundant structural data mining

Affiliation

Comparison of sequence and structure-based datasets for nonredundant structural data mining

Authors

Affiliation

Abstract

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources