Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2005 Apr;15(4):537-51.
doi: 10.1101/gr.3096505.

Functional insights from the distribution and role of homopeptide repeat-containing proteins

Affiliations
Comparative Study

Functional insights from the distribution and role of homopeptide repeat-containing proteins

Noel G Faux et al. Genome Res. 2005 Apr.

Abstract

Expansion of "low complex" repeats of amino acids such as glutamine (Poly-Q) is associated with protein misfolding and the development of degenerative diseases such as Huntington's disease. The mechanism by which such regions promote misfolding remains controversial, the function of many repeat-containing proteins (RCPs) remains obscure, and the role (if any) of repeat regions remains to be determined. Here, a Web-accessible database of RCPs is presented. The distribution and evolution of RCPs that contain homopeptide repeats tracts are considered, and the existence of functional patterns investigated. Generally, it is found that while polyamino acid repeats are extremely rare in prokaryotes, several eukaryote putative homologs of prokaryote RCP-involved in important housekeeping processes-retain the repetitive region, suggesting an ancient origin for certain repeats. Within eukarya, the most common uninterrupted amino acid repeats are glutamine, asparagines, and alanine. Interestingly, while poly-Q repeats are found in vertebrates and nonvertebrates, poly-N repeats are only common in more primitive nonvertebrate organisms, such as insects and nematodes. We have assigned function to eukaryote RCPs using Online Mendelian Inheritance in Man (OMIM), the Human Reference Protein Database (HRPD), FlyBase, and Wormpep. Prokaryote RCPs were annotated using BLASTp searches and Gene Ontology. These data reveal that the majority of RCPs are involved in processes that require the assembly of large, multiprotein complexes, such as transcription and signaling.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Distribution of RCPs in GENPEPT. (A) The distribution of the RCPs in GENPEPT and the total number of repeats in GENPEPT. The bars represent the total number of repeats and the solid diamonds the number of RCPs. (B) Distribution of the repeats based on physicochemical class (polar, hydrophobic, acidic, and basic). Red bars represent the number of repeats for the amino acid class normalized for the amino acid frequencies in GENPEPT for that amino acid class (i.e., number of repeats in class X/[amino acid frequency for class X, not including the RCPs]). The solid blue diamonds represent frequency of the amino acid class in the RCPs.
Figure 2.
Figure 2.
Length of homopeptide repeats. Three-dimensional plot showing repeat length (x-axis) versus amino acid type (y-axis; also highlighted in key) versus percentage of each repeat class of a particular length limited to those repeats <51 amino acids in length. A blank square indicates that no repeat of that length and type exists in GENPEPT. There are repeats >50 amino acids in length; however, these are infrequent, with lengths up to 410 amino acids and a sporadic distribution.
Figure 3.
Figure 3.
Distribution of RCPs in eukaryotes. (A) Distribution of the RCPs in vertebrate species, Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), Xenopus laevis (frog), Danio rerio (fish), and Gallus gallus (chicken). (B) Distribution of RCPs in nonvertebrate species, Drosophila melanogaster (fly), Oryza sativa (rice), Arabidopsis thaliana (thale cress), Saccharomyces cerevisiae (yeast), Plasmodium falciparum (malaria), Anopheles gambiae str. PEST (mosquito), Caenorhabditis elegans (nematode), and Triticum aestivum (wheat). The genomes of the chosen species have been completed or are near completion.
Figure 4.
Figure 4.
Multiple sequence alignment of DnaJ. A multiple alignment of the glycine repeat in DnaJ (Hsp40). From prokarya: Bacillus halodurans, Clostridium thermocellum, Sinorhizobium meliloti, Shigella flexneri, Magnetospirillum magnetotacticum, Geobacter sulfurreducens PCA, Leptospira interrogans serovar lai str. 56601, Thermosynechococcus elongatus BP-1, Nostoc sp. PCC 7120, Prochlorococcus marinus, Trichodesmium erythraeum IMS101, Methanosarcina mazei Goe1, Treponema pallidum, Rubrobacter xylanophilus DSM 9941, Fusobacterium nucleatum subsp. polymorphum, Methanothermobacter thermautotrophicus str. Delta H, Pirellula sp., Chlamydia muridarum, Parachlamydia sp. UWE25, Bacteroides thetaiotaomicron VPI-5482, Acholeplasma laidlawii, Chlorobium tepidum TLS, Thermobifida fusca, Halobacterium sp. NRC-1, and Deinococcus radiodurans. From eukarya: H. sapiens, R. norvegicus, M. musculus, D. rerio, A. gambiae str. PEST, C. elegans, and P. falciparum. The boxed region from positions 20–106 highlights the glycine/phenylalanine-rich region. The sequences were manually positioned with respect to the first instance of the highly conserved motif DxF (boxed, position 58–60). The regions to the left and right of the boxed region were aligned with T-COFFEE (Notredame et al. 2000) and the final figure was generated with ALSCRIPT (Barton 1993).
Figure 5.
Figure 5.
Multiple sequence alignment of the ribosomal protein L12. A multiple sequence alignment of the glutamic acid repeat in L12 from the prokaryote species M. thermautotrophicus str. Delta H, Archaeoglobus fulgidus DSM 4304, Pyrococcus abyssi, M. kandleri AV19, Methanosarcina barkeri, Methanosarcina mazei Goe1, Halobacterium sp. NRC-1, Haloarcula marismortui, Haloferax volcanii, Nanoarchaeum equitans Kin4-M, Ferroplasma acidarmanus, Sulfolobus solfataricus, Thermoplasma acidophilum, and the eukaryote species M. musculus, H. sapiens, R. norvegicus, D. rerio, C. elegans (A) gi 25141400, (B) gi 17543850, S. cerevisiae (A) gi 171813, (B) gi 171815, (C) gi 236358, O. sativa, Eremothecium gossypii, P. falciparum. The boxed region from positions 98–144 highlights the two amino acid-rich regions, the N-terminal alanine-rich region and the C-terminal glutamic acid-rich region. Since no obvious alignment could be built of this region, the sequences were flushed left. The regions to the left and right of the boxed region was aligned with T-COFFEE (Notredame et al. 2000) and then manually adjusted. The final figure was generated with ALSCRIPT (Barton 1993).
Figure 6.
Figure 6.
Multiple sequence alignment of the ribosomal protein L10. A multiple alignment of the glutamic acid-rich region in L10 from the following prokaryote species: Halobacterium sp. NRC-1, Haloarcula marismortui, Haloferax volcanii, Sulfolobus solfataricus, M. vannielii, Methanosarcina acetivorans str. C2A, M. kandleri AV19, M. thermautotrophicus str. Delta H, Archaeoglobus fulgidus DSM 4304, P. abyssi, and the eukaryote species H. sapiens, R. norvegicus, D. rerio, A. gambiae str. PEST, C. elegans, P. falciparum. The boxed region from positions 52–117 highlights the acidic tail. Since no obvious alignment could be built of this region, the sequences were flushed left. The regions to the left and right of the boxed region was aligned with T-COFFEE (Notredame et al. 2000) and then manually adjusted. The final figure was generated with ALSCRIPT (Barton 1993).
Figure 7.
Figure 7.
Function of RCPs. (A) Bar graph showing the function of human RCPs, based upon OMIM. Amino acid types are represented by different colors (see key). On the x-axis, the functional classes DNA RNA (i.e., Transcription/chromatin binding/DNA binding/RNA binding/translation), signaling, unknown, Enzyme, structural, transport protein, adhesion, channel, metabolism, and other are shown. (B) Pie chart showing the function of D. melanogaster RCPs, (C) pie chart showing the function of C. elegans RCPs, (D) pie chart showing the function of prokaryote RCPs.
Figure 8.
Figure 8.
General role of repeats regions. It is suggested that RCPs (red) function from within large mutiprotein and/or nucleic acid complexes (green circle). An example is shown where a two-domain protein (pink circles) functions via a flexible repeat to recruit an additional binding partner (blue circle).

References

    1. Akey, C.W. and Luger, K. 2003. Histone chaperones and nucleosome assembly. Curr. Opin. Struct. Biol. 13: 6-14. - PubMed
    1. Alba, M.M. and Guigo, R. 2004. Comparative analysis of amino acid repeats in rodents and humans. Genome Res. 14: 549-554. - PMC - PubMed
    1. Alba, M.M., Laskowski, R.A., and Hancock, J.M. 2002. Detecting cryptically simple protein sequences using the SIMPLE algorithm. Bioinformatics 18: 672-678. - PubMed
    1. Barton, G.J. 1993. ALSCRIPT: A tool to format multiple sequence alignments. Protein Eng. 6: 37-40. - PubMed
    1. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138-D141. - PMC - PubMed

Web site references

    1. http://repeats.med.monash.edu.au; A database of homopeptide repeats.
    1. http://www.hprd.org/; Human Protein Reference Database.
    1. ftp://ftp.ncbi.nih.gov/blast/db/; NCBI ftp site of available databases.
    1. http://www.ncbi.nlm.nih.gov/omim/; Online Mendelian Inheritance in Man, OMIM. McKusick-Nathans Institute for Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD), 2000.
    1. http://www.sanger.ac.uk/Projects/C_elegans/WORMBASE/current/wormpep.shtml; Wormpep.

Publication types

LinkOut - more resources