Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Aug 25;2(8):e114.
doi: 10.1371/journal.pcbi.0020114. Epub 2006 Jul 14.

Expansion of protein domain repeats

Affiliations

Expansion of protein domain repeats

Asa K Björklund et al. PLoS Comput Biol. .

Abstract

Many proteins, especially in eukaryotes, contain tandem repeats of several domains from the same family. These repeats have a variety of binding properties and are involved in protein-protein interactions as well as binding to other ligands such as DNA and RNA. The rapid expansion of protein domain repeats is assumed to have evolved through internal tandem duplications. However, the exact mechanisms behind these tandem duplications are not well-understood. Here, we have studied the evolution, function, protein structure, gene structure, and phylogenetic distribution of domain repeats. For this purpose we have assigned Pfam-A domain families to 24 proteomes with more sensitive domain assignments in the repeat regions. These assignments confirmed previous findings that eukaryotes, and in particular vertebrates, contain a much higher fraction of proteins with repeats compared with prokaryotes. The internal sequence similarity in each protein revealed that the domain repeats are often expanded through duplications of several domains at a time, while the duplication of one domain is less common. Many of the repeats appear to have been duplicated in the middle of the repeat region. This is in strong contrast to the evolution of other proteins that mainly works through additions of single domains at either terminus. Further, we found that some domain families show distinct duplication patterns, e.g., nebulin domains have mainly been expanded with a unit of seven domains at a time, while duplications of other domain families involve varying numbers of domains. Finally, no common mechanism for the expansion of all repeats could be detected. We found that the duplication patterns show no dependence on the size of the domains. Further, repeat expansion in some families can possibly be explained by shuffling of exons. However, exon shuffling could not have created all repeats.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Domain Assignments and Exon Structure for the Chicken Nebulin Protein ENSGALP00000020382
The initial domain assignments (D) using an E-value cutoff at 0.1 detected 51 nebulin domains. With a less strict cutoff, we were able to assign 15 additional domains. Still, there are four gaps (regions with no domain assignment), which are likely to contain domains that cannot be detected with the current HMMs. Below the domain assignments, the exon structure (E) is seen, with a box for each of the 44 exons, where it is evident that a block of four exons (a long one in black, two short ones in white, and one intermediate size in gray) correspond to a block of seven domains even if the exon borders all are found within the domains.
Figure 2
Figure 2. Fraction of Proteins That Contain a Domain Repeat in Archaea, Bacteria, Yeast, and the Eight Multicellular Eukaryotes (Sorted by Number of Proteins)
The different patterns indicate the length of the repeat, i.e., whether it contains 2, 3, 4 domains, etc. The eukaryotic species are labeled with the abbreviations of species names such as Hsa for Homo sapiens followed by the number of proteins in each proteome. For a list of all species in this study, see Materials and Methods.
Figure 3
Figure 3. Overview of the Methodology
(A) In a protein with five domains, a unit of three N-terminal domains has been duplicated in tandem. (B) To identify this evolutionary event, alignment of all domain pairs in the protein is performed. (C) The alignment scores between the domains displayed in a matrix with increasing color intensity for higher scores. The diagonal shows alignment scores for each domain to itself, while square 1,2 gives the score between the first and the second domain. A pattern where domain pairs 3–6, 4–7, and 5–8 have the highest alignment scores can be seen. (D) From the alignment scores, an ACV is calculated as the mean alignment score at each distance normalized around zero. The distance between the domains is defined as one for neighbouring domains, while domain pairs with one domain between them have distance two, etc. In this example a peak at distance three can be seen. Hence, we assume that this protein has evolved through the duplication of three domains.
Figure 4
Figure 4. Pattern of Internal Domain Duplications in Two Human Proteins, ENSP00000319007 and ENSP00000303696, both with C2H2 Zinc Finger Repeats
(A) ENSP00000319007. (B) ENSP00000303696. The intensity of the squares reflects the alignment score with darker color for higher scores. The numbers at each axis indicate the domains in N-to-C terminal orientation within the repeat. In these two examples, patterns of duplication of six domains (A) and two domains (B) can be seen.
Figure 5
Figure 5. Pattern of Internal Domain Duplications in the Chicken Protein ENSGALP00000020382, with 66 Repeating Nebulin Domains (Pfam)
(A) The intensity of the squares is related to alignment scores, and the numbers on both axes indicate the domains in N-to-C terminal orientation. As there were gaps in the repeat sequence (Figure 1), these were introduced as domains at positions 6, 18, 25, and 32. (B) ACV calculated from the alignment scores in (A) with the average similarity to domains at distance 1, 2, 3, etc. The ACV are normalized around zero, hence the dotted line at zero is the mean score between all domains in the protein. The ACV was calculated before introducing the gaps as domains (dashed line) and after (solid line). When the regions with no domain assignments were regarded as domains, the pattern of seven repeating units became much clearer, indicating that the gaps are also domains.
Figure 6
Figure 6. ACVs for Proteins with Repeats of Eight Different Domain Families
Solid line shows ACVs for proteins with repeats of eight different domain families. In the bottom right diagram, the ACV for all proteins with repeats is displayed. The ACV for each family was normalized around zero, hence the dashed line at zero is the mean bit score between all domains in the family. The p-value for each datapoint was calculated from random shuffling of domains, and peaks with p-values below 10−5 are indicated with an asterisk. The dotted line illustrates the fraction of repeats of the domain family with each repeat length, i.e., nonrepeated proteins have length one. The number of proteins/domains that goes into each figure can be found in Materials and Methods. Data for the remaining domain families can be found in Figure S2.
Figure 7
Figure 7. Hierarchical Clustering of the ACVs from Each Protein
(A) Dendrogram of the 20 clusters. Each cluster is indicated by a cluster number followed by the number of proteins in the cluster. (B) The average ACV for each cluster with red color for values below the average and green for values above. (C) Distribution of the ten largest domain families, as well as nebulin, in the different clusters. The expected number of proteins from a domain family in each cluster was calculated using random shuffling, and Z-scores for overrepresentation (green) and underrepresentation (red) in the cluster were calculated. The numbers after the domain family names is the number of repeats of the family.
Figure 8
Figure 8. ACVs for All Proteins in Each of the 20 Clusters in Figure 7
The number of proteins in each cluster is indicated after the cluster number.

Similar articles

Cited by

References

    1. Apic G, Gough J, Teichmann SA. Domain combinations in archaeal, eubacterial, and eukaryotic proteomes. J Mol Biol. 2001;310:311–325. - PubMed
    1. Vogel C, Teichmann SA, Pereira-Leal J. The relationship between domain duplication and recombination. J Mol Biol. 2005;346:355–365. - PubMed
    1. Björklund ÅK, Ekman D, Light S, Frey-Skött J, Elofsson A. Domain rearrangements in protein evolution. J Mol Biol. 2005;353:911–923. - PubMed
    1. Weiner J, III, Beaussart F, Bornberg-Bauer E. Domain deletions and substitutions in the modular protein evolution. FEBS J. 2006;273:2037–2047. - PubMed
    1. Andrade M, Perez-Iratxeta C, Ponting C. Protein repeats: Structures, functions, and evolution. J Struct Biol. 2001;134:117–131. - PubMed

Publication types