Comparative Study

. 2005 Apr;15(4):537-51.

doi: 10.1101/gr.3096505.

Functional insights from the distribution and role of homopeptide repeat-containing proteins

Noel G Faux¹, Stephen P Bottomley, Arthur M Lesk, James A Irving, John R Morrison, Maria Garcia de la Banda, James C Whisstock

Affiliations

Affiliation

¹ Protein Crystallography Unit, Department of Biochemistry and Molecular Biology, School of Computer Science and Software Engineering, Monash University, Clayton Campus, Melbourne, VIC 3800, Australia.

PMID: 15805494
PMCID: PMC1074368
DOI: 10.1101/gr.3096505

Comparative Study

Functional insights from the distribution and role of homopeptide repeat-containing proteins

Noel G Faux et al. Genome Res. 2005 Apr.

. 2005 Apr;15(4):537-51.

doi: 10.1101/gr.3096505.

Authors

Noel G Faux¹, Stephen P Bottomley, Arthur M Lesk, James A Irving, John R Morrison, Maria Garcia de la Banda, James C Whisstock

Affiliation

¹ Protein Crystallography Unit, Department of Biochemistry and Molecular Biology, School of Computer Science and Software Engineering, Monash University, Clayton Campus, Melbourne, VIC 3800, Australia.

PMID: 15805494
PMCID: PMC1074368
DOI: 10.1101/gr.3096505

Abstract

Expansion of "low complex" repeats of amino acids such as glutamine (Poly-Q) is associated with protein misfolding and the development of degenerative diseases such as Huntington's disease. The mechanism by which such regions promote misfolding remains controversial, the function of many repeat-containing proteins (RCPs) remains obscure, and the role (if any) of repeat regions remains to be determined. Here, a Web-accessible database of RCPs is presented. The distribution and evolution of RCPs that contain homopeptide repeats tracts are considered, and the existence of functional patterns investigated. Generally, it is found that while polyamino acid repeats are extremely rare in prokaryotes, several eukaryote putative homologs of prokaryote RCP-involved in important housekeeping processes-retain the repetitive region, suggesting an ancient origin for certain repeats. Within eukarya, the most common uninterrupted amino acid repeats are glutamine, asparagines, and alanine. Interestingly, while poly-Q repeats are found in vertebrates and nonvertebrates, poly-N repeats are only common in more primitive nonvertebrate organisms, such as insects and nematodes. We have assigned function to eukaryote RCPs using Online Mendelian Inheritance in Man (OMIM), the Human Reference Protein Database (HRPD), FlyBase, and Wormpep. Prokaryote RCPs were annotated using BLASTp searches and Gene Ontology. These data reveal that the majority of RCPs are involved in processes that require the assembly of large, multiprotein complexes, such as transcription and signaling.

PubMed Disclaimer

Figures

**Figure 1.**
Distribution of RCPs in GENPEPT. (A) The distribution of the RCPs in GENPEPT and the total number of repeats in GENPEPT. The bars represent the total number of repeats and the solid diamonds the number of RCPs. (B) Distribution of the repeats based on physicochemical class (polar, hydrophobic, acidic, and basic). Red bars represent the number of repeats for the amino acid class normalized for the amino acid frequencies in GENPEPT for that amino acid class (i.e., number of repeats in class X/[amino acid frequency for class X, not including the RCPs]). The solid blue diamonds represent frequency of the amino acid class in the RCPs.

**Figure 2.**
Length of homopeptide repeats. Three-dimensional plot showing repeat length (x-axis) versus amino acid type (y-axis; also highlighted in key) versus percentage of each repeat class of a particular length limited to those repeats <51 amino acids in length. A blank square indicates that no repeat of that length and type exists in GENPEPT. There are repeats >50 amino acids in length; however, these are infrequent, with lengths up to 410 amino acids and a sporadic distribution.

**Figure 3.**
Distribution of RCPs in eukaryotes. (A) Distribution of the RCPs in vertebrate species, *Homo sapiens* (human), *Mus musculus* (mouse), *Rattus norvegicus* (rat), *Xenopus laevis* (frog), *Danio rerio* (fish), and *Gallus gallus* (chicken). (B) Distribution of RCPs in nonvertebrate species, *Drosophila melanogaster* (fly), *Oryza sativa* (rice), *Arabidopsis thaliana* (thale cress), *Saccharomyces cerevisiae* (yeast), *Plasmodium falciparum* (malaria), *Anopheles gambiae str.* PEST (mosquito), *Caenorhabditis elegans* (nematode), and *Triticum aestivum* (wheat). The genomes of the chosen species have been completed or are near completion.

**Figure 4.**
Multiple sequence alignment of DnaJ. A multiple alignment of the glycine repeat in DnaJ (Hsp40). From prokarya: *Bacillus halodurans, Clostridium thermocellum, Sinorhizobium meliloti, Shigella flexneri, Magnetospirillum magnetotacticum, Geobacter sulfurreducens* PCA, *Leptospira interrogans serovar lai str.* 56601, *Thermosynechococcus elongatus* BP-1, *Nostoc sp.* PCC 7120, *Prochlorococcus marinus, Trichodesmium erythraeum* IMS101, *Methanosarcina mazei* Goe1, *Treponema pallidum, Rubrobacter xylanophilus* DSM 9941, *Fusobacterium nucleatum subsp. polymorphum, Methanothermobacter thermautotrophicus str.* Delta H, *Pirellula sp., Chlamydia muridarum, Parachlamydia sp.* UWE25, *Bacteroides thetaiotaomicron* VPI-5482, *Acholeplasma laidlawii, Chlorobium tepidum* TLS, *Thermobifida fusca, Halobacterium sp.* NRC-1, and *Deinococcus radiodurans*. From eukarya: *H. sapiens, R. norvegicus, M. musculus, D. rerio, A. gambiae str.* PEST, *C. elegans, and P. falciparum*. The boxed region from positions 20–106 highlights the glycine/phenylalanine-rich region. The sequences were manually positioned with respect to the first instance of the highly conserved motif DxF (boxed, position 58–60). The regions to the *left* and *right* of the boxed region were aligned with T-COFFEE (Notredame et al. 2000) and the final figure was generated with ALSCRIPT (Barton 1993).

**Figure 5.**
Multiple sequence alignment of the ribosomal protein L12. A multiple sequence alignment of the glutamic acid repeat in L12 from the prokaryote species *M. thermautotrophicus* str. Delta H, *Archaeoglobus fulgidus* DSM 4304, *Pyrococcus abyssi, M. kandleri* AV19, *Methanosarcina barkeri, Methanosarcina mazei* Goe1, *Halobacterium sp.* NRC-1, *Haloarcula marismortui, Haloferax volcanii, Nanoarchaeum equitans* Kin4-M, *Ferroplasma acidarmanus, Sulfolobus solfataricus, Thermoplasma acidophilum*, and the eukaryote species *M. musculus, H. sapiens, R. norvegicus, D. rerio, C. elegans* (A) gi 25141400, (B) gi 17543850, *S. cerevisiae* (A) gi 171813, (B) gi 171815, (C) gi 236358, O. sativa, Eremothecium gossypii, *P. falciparum*. The boxed region from positions 98–144 highlights the two amino acid-rich regions, the N-terminal alanine-rich region and the C-terminal glutamic acid-rich region. Since no obvious alignment could be built of this region, the sequences were flushed *left*. The regions to the *left* and *right* of the boxed region was aligned with T-COFFEE (Notredame et al. 2000) and then manually adjusted. The final figure was generated with ALSCRIPT (Barton 1993).

**Figure 6.**
Multiple sequence alignment of the ribosomal protein L10. A multiple alignment of the glutamic acid-rich region in L10 from the following prokaryote species: *Halobacterium sp.* NRC-1, *Haloarcula marismortui, Haloferax volcanii, Sulfolobus solfataricus, M. vannielii, Methanosarcina acetivorans str.* C2A, *M. kandleri* AV19, *M. thermautotrophicus str.* Delta H, *Archaeoglobus fulgidus* DSM 4304, *P. abyssi*, and the eukaryote species *H. sapiens, R. norvegicus, D. rerio, A. gambiae str.* PEST, *C. elegans, P. falciparum*. The boxed region from positions 52–117 highlights the acidic tail. Since no obvious alignment could be built of this region, the sequences were flushed *left*. The regions to the *left* and *right* of the boxed region was aligned with T-COFFEE (Notredame et al. 2000) and then manually adjusted. The final figure was generated with ALSCRIPT (Barton 1993).

**Figure 7.**
Function of RCPs. (A) Bar graph showing the function of human RCPs, based upon OMIM. Amino acid types are represented by different colors (see key). On the x-axis, the functional classes DNA RNA (i.e., Transcription/chromatin binding/DNA binding/RNA binding/translation), signaling, unknown, Enzyme, structural, transport protein, adhesion, channel, metabolism, and other are shown. (B) Pie chart showing the function of *D. melanogaster* RCPs, (C) pie chart showing the function of *C. elegans* RCPs, (D) pie chart showing the function of prokaryote RCPs.

**Figure 8.**
General role of repeats regions. It is suggested that RCPs (red) function from within large mutiprotein and/or nucleic acid complexes (green circle). An example is shown where a two-domain protein (pink circles) functions via a flexible repeat to recruit an additional binding partner (blue circle).

See this image and copyright information in PMC

References

1. Akey, C.W. and Luger, K. 2003. Histone chaperones and nucleosome assembly. Curr. Opin. Struct. Biol. 13: 6-14. - PubMed
1. Alba, M.M. and Guigo, R. 2004. Comparative analysis of amino acid repeats in rodents and humans. Genome Res. 14: 549-554. - PMC - PubMed
1. Alba, M.M., Laskowski, R.A., and Hancock, J.M. 2002. Detecting cryptically simple protein sequences using the SIMPLE algorithm. Bioinformatics 18: 672-678. - PubMed
1. Barton, G.J. 1993. ALSCRIPT: A tool to format multiple sequence alignments. Protein Eng. 6: 37-40. - PubMed
1. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138-D141. - PMC - PubMed

Web site references

1. http://repeats.med.monash.edu.au; A database of homopeptide repeats.
1. http://www.hprd.org/; Human Protein Reference Database.
1. ftp://ftp.ncbi.nih.gov/blast/db/; NCBI ftp site of available databases.
1. http://www.ncbi.nlm.nih.gov/omim/; Online Mendelian Inheritance in Man, OMIM. McKusick-Nathans Institute for Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD), 2000.
1. http://www.sanger.ac.uk/Projects/C_elegans/WORMBASE/current/wormpep.shtml; Wormpep.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- BacDive

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Functional insights from the distribution and role of homopeptide repeat-containing proteins

Affiliation

Functional insights from the distribution and role of homopeptide repeat-containing proteins

Authors

Affiliation

Abstract

Figures

References

Web site references

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases