Protein family clustering for structural genomics
- PMID: 16185712
- DOI: 10.1016/j.jmb.2005.08.058
Protein family clustering for structural genomics
Abstract
A major goal of structural genomics is the provision of a structural template for a large fraction of protein domains. The magnitude of this task depends on the number and nature of protein sequence families. With a large number of bacterial genomes now fully sequenced, it is possible to obtain improved estimates of the number and diversity of families in that kingdom. We have used an automated clustering procedure to group all sequences in a set of genomes into protein families. Bench-marking shows the clustering method is sensitive at detecting remote family members, and has a low level of false positives. This comprehensive protein family set has been used to address the following questions. (1) What is the structure coverage for currently known families? (2) How will the number of known apparent families grow as more genomes are sequenced? (3) What is a practical strategy for maximizing structure coverage in future? Our study indicates that approximately 20% of known families with three or more members currently have a representative structure. The study indicates also that the number of apparent protein families will be considerably larger than previously thought: We estimate that, by the criteria of this work, there will be about 250,000 protein families when 1000 microbial genomes have been sequenced. However, the vast majority of these families will be small, and it will be possible to obtain structural templates for 70-80% of protein domains with an achievable number of representative structures, by systematically sampling the larger families.
Similar articles
-
Progress of structural genomics initiatives: an analysis of solved target structures.J Mol Biol. 2005 May 20;348(5):1235-60. doi: 10.1016/j.jmb.2005.03.037. Epub 2005 Apr 2. J Mol Biol. 2005. PMID: 15854658
-
Strategies for high-throughput comparative modeling: applications to leverage analysis in structural genomics and protein family organization.Proteins. 2007 Mar 1;66(4):766-77. doi: 10.1002/prot.21191. Proteins. 2007. PMID: 17154423
-
Defining the fold space of membrane proteins: the CAMPS database.Proteins. 2006 Sep 1;64(4):906-22. doi: 10.1002/prot.21081. Proteins. 2006. PMID: 16802318
-
Solution NMR in structural genomics.Curr Opin Struct Biol. 2006 Oct;16(5):611-7. doi: 10.1016/j.sbi.2006.08.002. Epub 2006 Aug 30. Curr Opin Struct Biol. 2006. PMID: 16942869 Review.
-
[Development of antituberculous drugs: current status and future prospects].Kekkaku. 2006 Dec;81(12):753-74. Kekkaku. 2006. PMID: 17240921 Review. Japanese.
Cited by
-
Composition bias and the origin of ORFan genes.Bioinformatics. 2010 Apr 15;26(8):996-9. doi: 10.1093/bioinformatics/btq093. Epub 2010 Mar 15. Bioinformatics. 2010. PMID: 20231229 Free PMC article.
-
Using phylogeny to improve genome-wide distant homology recognition.PLoS Comput Biol. 2007 Jan 19;3(1):e3. doi: 10.1371/journal.pcbi.0030003. Epub 2006 Nov 20. PLoS Comput Biol. 2007. PMID: 17238281 Free PMC article.
-
Structural genomics: keeping up with expanding knowledge of the protein universe.Curr Opin Struct Biol. 2007 Jun;17(3):347-53. doi: 10.1016/j.sbi.2007.06.003. Epub 2007 Jun 22. Curr Opin Struct Biol. 2007. PMID: 17587562 Free PMC article. Review.
-
Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint.BMC Bioinformatics. 2007 Mar 9;8:86. doi: 10.1186/1471-2105-8-86. BMC Bioinformatics. 2007. PMID: 17349043 Free PMC article.
-
A limited universe of membrane protein families and folds.Protein Sci. 2006 Jul;15(7):1723-34. doi: 10.1110/ps.062109706. Protein Sci. 2006. PMID: 16815920 Free PMC article.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources