. 2005 Jan 1;33(Database issue):D178-82.

doi: 10.1093/nar/gki060.

eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity

Qiaojuan Jane Su¹, Lin Lu, Serge Saxonov, Douglas L Brutlag

Affiliations

PMID: 15608172
PMCID: PMC540014
DOI: 10.1093/nar/gki060

eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity

Qiaojuan Jane Su et al. Nucleic Acids Res. 2005.

. 2005 Jan 1;33(Database issue):D178-82.

doi: 10.1093/nar/gki060.

Authors

Qiaojuan Jane Su¹, Lin Lu, Serge Saxonov, Douglas L Brutlag

Affiliation

¹ Abgenix, Inc., 6701 Kaiser Drive, MS 11, Fremont, CA 94555, USA.

PMID: 15608172
PMCID: PMC540014
DOI: 10.1093/nar/gki060

Abstract

Classifying proteins into families and superfamilies allows identification of functionally important conserved domains. The motifs and scoring matrices derived from such conserved regions provide computational tools that recognize similar patterns in novel sequences, and thus enable the prediction of protein function for genomes. The eBLOCKs database enumerates a cascade of protein blocks with varied conservation levels for each functional domain. A biologically important region is most stringently conserved among a smaller family of highly similar proteins. The same region is often found in a larger group of more remotely related proteins with a reduced stringency. Through enumeration, highly specific signatures can be generated from blocks with more columns and fewer family members, while highly sensitive signatures can be derived from blocks with fewer columns and more members as in a superfamily. By applying PSI-BLAST and a modified K-means clustering algorithm, eBLOCKs automatically groups protein sequences according to different levels of similarity. Multiple sequence alignments are made and trimmed into a series of ungapped blocks. Motifs and position-specific scoring matrices were derived from eBLOCKs and made available for sequence search and annotation. The eBLOCKs database provides a tool for high-throughput genome annotation with maximal specificity and sensitivity. The eBLOCKs database is freely available on the World Wide Web at http://motif.stanford.edu/eblocks/ to all users for online usage. Academic and not-for-profit institutions wishing copies of the program may contact Douglas L. Brutlag (brutlag@stanford.edu). Commercial firms wishing copies of the program for internal installation may contact Jacqueline Tay at the Stanford Office of Technology Licensing (jacqueline.tay@stanford.edu; http://otl.stanford.edu/).

PubMed Disclaimer

Figures

**Figure 1**
A typical PSI-BLAST result have multiple similarity modules. Group 1 contains sequences in Cluster 1; Group 2 contains sequences in Clusters 1 and 2; and Group 3 contains sequences in Clusters 1 and 3.

**Figure 2**
Clusters defined by K-means clustering are organized into groups. A typical conservation region is represented by multiple groups with different similarity levels, so as to maximize specificity and sensitivity. Group 8 contains sequences in Cluster 8; Group 2 contains sequences in Clusters 8 and 2; Group 9 contains sequences in Clusters 8, 2 and 9.

**Figure 3**
A flowchart for the eBLOCKs algorithm. Similarity groups that represent shared modules at different conservation levels are formed by the clustering and grouping of all the subject sequences returned by a PSI-BLAST search. Sequences in each group are aligned and the ungapped regions are excised to form several blocks. An eBLOCK accession number is composed of three parts: the SWISS-PROT accession number of the seed sequence, the group number as assigned by K-means clustering and the block number as the sequential number of trimmed blocks from the multiple sequence alignment for the group.

**Figure 4**
Statistics of the current eBLOCKs database. (a) The distribution of the average information content for the blocks. (b) The distribution of block width. (c) The distribution of the number of sequences contained in the blocks.

See this image and copyright information in PMC

Cited by

Choosing negative examples for the prediction of protein-protein interactions.
Ben-Hur A, Noble WS. Ben-Hur A, et al. BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2105-7-S1-S2. BMC Bioinformatics. 2006. PMID: 16723005 Free PMC article.
Analysis of the peroxiredoxin family: using active-site structure and sequence information for global classification and residue analysis.
Nelson KJ, Knutson ST, Soito L, Klomsiri C, Poole LB, Fetrow JS. Nelson KJ, et al. Proteins. 2011 Mar;79(3):947-64. doi: 10.1002/prot.22936. Epub 2010 Dec 22. Proteins. 2011. PMID: 21287625 Free PMC article.
Protein structural modularity and robustness are associated with evolvability.
Rorick MM, Wagner GP. Rorick MM, et al. Genome Biol Evol. 2011;3:456-75. doi: 10.1093/gbe/evr046. Epub 2011 May 21. Genome Biol Evol. 2011. PMID: 21602570 Free PMC article.
MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences.
Hsu CM, Chen CY, Liu BJ. Hsu CM, et al. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W356-61. doi: 10.1093/nar/gkl309. Nucleic Acids Res. 2006. Corrected and republished in: Nucleic Acids Res. 2008 Mar;36(4):1400-6. doi: 10.1093/nar/gkm717. PMID: 16845025 Free PMC article. Corrected and republished.
InSite: a computational method for identifying protein-protein interaction binding sites on a proteome-wide scale.
Wang H, Segal E, Ben-Hur A, Li QR, Vidal M, Koller D. Wang H, et al. Genome Biol. 2007;8(9):R192. doi: 10.1186/gb-2007-8-9-r192. Genome Biol. 2007. PMID: 17868464 Free PMC article.

See all "Cited by" articles

References

1. Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
1. Altschul S.F. and Gish,W. (1996) local alignment statistics. Methods Enzymol., 266, 460–480. - PubMed
1. Pearson W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. - PMC - PubMed
1. Smith T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197. - PubMed
1. Bateman A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., Sonnhammer,E.L., Studholme,D.J., Yeats,C. and Eddy,S.R. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138–D141. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

HG02235-07/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity

Affiliation

eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials