Clustering of proximal sequence space for the identification of protein families
- PMID: 12117788
- DOI: 10.1093/bioinformatics/18.7.908
Clustering of proximal sequence space for the identification of protein families
Abstract
Motivation: The study of sequence space, and the deciphering of the structure of protein families and subfamilies, has up to now been required for work in comparative genomics and for the prediction of protein function. With the emergence of structural proteomics projects, it is becoming increasingly important to be able to select protein targets for structural studies that will appropriately cover the space of protein sequences, functions and genomic distribution. These problems are the motivation for the development of methods for clustering protein sequences and building families of potentially orthologous sequences, such as those proposed here.
Results: First we developed a clustering strategy (Ncut algorithm) capable of forming groups of related sequences by assessing their pairwise relationships. The results presented for the ras super-family of proteins are similar to those produced by other clustering methods, but without the need for clustering the full sequence space. The Ncut clusters are then used as the input to a process of reconstruction of groups with equilibrated genomic composition formed by closely-related sequences. The results of applying this technique to the data set used in the construction of the COG database are very similar to those derived by the human experts responsible for this database.
Availability: The analysis of different systems, including the COG equivalent 21 genomes are available at http://www.pdg.cnb.uam.es/GenoClustering.html.
Similar articles
-
Target space for structural genomics revisited.Bioinformatics. 2002 Jul;18(7):922-33. doi: 10.1093/bioinformatics/18.7.922. Bioinformatics. 2002. PMID: 12117789
-
Selecting targets for structural determination by navigating in a graph of protein families.Bioinformatics. 2002 Jul;18(7):899-907. doi: 10.1093/bioinformatics/18.7.899. Bioinformatics. 2002. PMID: 12117787
-
Statistically rigorous automated protein annotation.Bioinformatics. 2004 May 1;20(7):1066-73. doi: 10.1093/bioinformatics/bth039. Epub 2004 Feb 5. Bioinformatics. 2004. PMID: 14764575
-
Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book.Nat Methods. 2004 Dec;1(3):195-202. doi: 10.1038/nmeth725. Nat Methods. 2004. PMID: 15789030 Review.
-
Pfam 10 years on: 10,000 families and still growing.Brief Bioinform. 2008 May;9(3):210-9. doi: 10.1093/bib/bbn010. Epub 2008 Mar 15. Brief Bioinform. 2008. PMID: 18344544 Review.
Cited by
-
Functional classification using phylogenomic inference.PLoS Comput Biol. 2006 Jun 30;2(6):e77. doi: 10.1371/journal.pcbi.0020077. PLoS Comput Biol. 2006. PMID: 16846248 Free PMC article. Review. No abstract available.
-
A Bayesian sampler for optimization of protein domain hierarchies.J Comput Biol. 2014 Mar;21(3):269-86. doi: 10.1089/cmb.2013.0099. Epub 2014 Feb 4. J Comput Biol. 2014. PMID: 24494927 Free PMC article.
-
clusterMaker: a multi-algorithm clustering plugin for Cytoscape.BMC Bioinformatics. 2011 Nov 9;12:436. doi: 10.1186/1471-2105-12-436. BMC Bioinformatics. 2011. PMID: 22070249 Free PMC article.
-
Automated protein subfamily identification and classification.PLoS Comput Biol. 2007 Aug;3(8):e160. doi: 10.1371/journal.pcbi.0030160. PLoS Comput Biol. 2007. PMID: 17708678 Free PMC article.
-
OrthoMCL: identification of ortholog groups for eukaryotic genomes.Genome Res. 2003 Sep;13(9):2178-89. doi: 10.1101/gr.1224503. Genome Res. 2003. PMID: 12952885 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources