Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar;33(3):783-99.
doi: 10.1093/molbev/msv271. Epub 2015 Nov 25.

Intein Clustering Suggests Functional Importance in Different Domains of Life

Affiliations

Intein Clustering Suggests Functional Importance in Different Domains of Life

Olga Novikova et al. Mol Biol Evol. 2016 Mar.

Abstract

Inteins, also called protein introns, are self-splicing mobile elements found in all domains of life. A bioinformatic survey of genomic data highlights a biased distribution of inteins among functional categories of proteins in both bacteria and archaea, with a strong preference for a single network of functions containing replisome proteins. Many nonorthologous, functionally equivalent replicative proteins in bacteria and archaea carry inteins, suggesting a selective retention of inteins in proteins of particular functions across domains of life. Inteins cluster not only in proteins with related roles but also in specific functional units of those proteins, like ATPase domains. This peculiar bias does not fully fit the models describing inteins exclusively as parasitic elements. In such models, evolutionary dynamics of inteins is viewed primarily through their mobility with the intein homing endonuclease (HEN) as the major factor of intein acquisition and loss. Although the HEN is essential for intein invasion and spread in populations, HEN dynamics does not explain the observed biased distribution of inteins among proteins in specific functional categories. We propose that the protein splicing domain of the intein can act as an environmental sensor that adapts to a particular niche and could increase the chance of the intein becoming fixed in a population. We argue that selective retention of some inteins might be beneficial under certain environmental stresses, to act as panic buttons that reversibly inhibit specific networks, consistent with the observed intein distribution.

Keywords: ATPases; Clusters of Orthologous Groups; evolution; replicative helicase; replisome.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.
Fig. 1.
Distribution of intein-containing proteins is sporadic. (A) Summary of intein mining. Total number of genomes analyzed, number and fraction of genomes with inteins, and total number of inteins found are indicated. In the present study, the species and a reference genome for all the strains of a species were defined following NCBI RefSeq microbial genome collection procedure (Tatusova et al. 2015). (B) Schematic evolutionary tree for some bacterial and archaeal clades, and list of eukaryal clades. The three domains of life are indicated on the left. Results are from intein mining of genomic sequences. Horizontal bars represent the number of genomes either with (red, blue, or green) or without (black) inteins. The pie charts next to the taxon names indicate the fraction of genomes with inteins for groups with large numbers of species. The bacterial and archaeal evolutionary tree reproduced after “The All-Species Living Tree” Project with modifications (Yarza et al. 2008). Not all intein-containing bacterial clades are shown. The full list of species and their taxonomy is available in supplementary table S1 (bacteria) and table S2 (archaea), Supplementary Material online.
F<sc>ig</sc>. 2.
Fig. 2.
Distribution of inteins does not correlate with genome size. (A) Distribution of the genome sizes and number of inteins in three bacterial clades. The clades are Proteobacteria (1016 species), Actinobacteria (527 species), and Cyanobacteria (130 species). (B) Distribution of the genome sizes and number of inteins in Euryarchaeota (190 species). For (A) and (B), vertical axis of plots represents distribution of the genome sizes (gray) and number of inteins (red and blue) in corresponding species on the horizontal axis. A representative species, genome size, and number of inteins are indicated for each group. No strong correlation between genome size and number of inteins was found, as indicated by correlation coefficient (r) for each group. Distribution of the coding sequences and frequencies of inteins (number of inteins per 1,000 coding sequences) is available in supplementary figure S2, Supplementary Material online. Correlation coefficients are provided in supplementary table S4, Supplementary Material online.
F<sc>ig</sc>. 3.
Fig. 3.
Functional genomics of intein-containing proteins. (A) Dominant functional categories of proteins with inteins based on Clusters of Orthologous Groups (COGs). COG annotation is for bacteria (red bars) and archaea (blue bars). The frequency of proteins with inteins/COGs is shown above and the frequency of proteins/COGs for each functional category within randomized data sets of proteins from bacteria and archaea is shown below. Frequency of intein-containing proteins in the top functional categories is indicated next to arrows. Functional category L (replication, recombination and repair) and F (nucleotide transport and metabolism) are dominant among intein-containing proteins in both bacteria and archaea. Functional categories are designated based on conventional classification (Tatusov et al. ; Tatusov et al. ; Galperin et al. 2015) and are as follows: J, translation, ribosomal structure and biogenesis; K, transcription; D, cell cycle control, cell division, chromosome partitioning; M, cell wall/membrane/envelope biogenesis; N, cell motility; O, post-translational modification, protein turnover, chaperones; P, Inorganic ion transport and metabolism; T, signal transduction mechanisms; C, energy production and conversion; E, amino acid transport and metabolism; G, carbohydrate transport and metabolism; H, coenzyme transport and metabolism; I, lipid transport and metabolism; Q, secondary metabolites biosynthesis, transport and catabolism; R, general function prediction only; S, function unknown; U, intracellular trafficking, secretion, and vesicular transport; V, defense mechanisms; W, extracellular structures; X, mobilome: prophage, transposons. (B) GO enrichment analysis for bacterial and archaeal intein-containing proteins. GO enrichment of 1,047 bacterial (red) and 502 archaeal (blue) intein-containing proteins was performed using WEGO (Ye et al. 2006). Enriched GO terms in binding and molecular function are shown. DNA and ATP binding as well as ATPase activities are the dominant GO terms among the intein-containing proteins from both bacteria and archaea. The percentage of the associated proteins is indicated on the top for dominant categories. CoF, cofactor; Me, metal clusters; Pr, protein; Ox/Red, oxidoreductase; Trans, transferase; Iso, isomerase; Lig, ligase; DA, deaminase.
F<sc>ig</sc>. 4.
Fig. 4.
Intein-containing proteins are often members of the same complexes and networks. (A) Top 15 bacterial intein-containing proteins, their interactions, and intein distribution. Proteins of the DNA replication fork are boxed at the center of the network. Network was reconstructed using STRING database of known and predicted protein interactions (http://string-db.org/; Szklarczyk et al. 2015). The list of the proteins, their full names, and description is available in table 1. Critical proteins with no inteins are shown in gray as follows: PolIIIβ, DNA polymerase III beta subunit; RimO, 2-methylthioadenine synthetase. The heatmap reflects distribution of the inteins among listed proteins (top) of four bacterial clades (side): Actinobacteria (Actino), Bacteroidetes (Bacter), Cyano (Cyanobacteria), and Proteobacteria (Proteo). For full network and list of the proteins, see supplementary figure S3 and table S5, Supplementary Material online. (B) Top 15 archaeal intein-containing proteins, their interactions, and intein distribution. Proteins of the DNA replication fork are boxed at the center of the network. Network was reconstructed as in panel (A) and the proteins listed in table 2. Critical proteins with no inteins are shown in gray as follows: PCNA, proliferating cell nuclear antigen or DNA clamp; Pri, primase. The heatmap is displayed as in (A) with three groups indicated on the side: Euryarchaeota (Eury), Crenarchaeota (Cren), and other archaea (Other). For full network and list of the proteins, see supplementary figure S4 and table S6, Supplementary Material online.
F<sc>ig</sc>. 5.
Fig. 5.
Replicative helicases DnaB and MCM are the most common intein-containing proteins. (A) Distribution of inteins in DnaB and MCM. Phylogenetic trees for bacterial replicative helicase DnaB and archaeal helicase MCM were reconstructed based on the extein amino acid sequences (ATPase domain) using the ML algorithm with the WAG (Whelan and Goldman)+G+I models; 50 representatives covering major bacterial and archaeal diversity were chosen among both DnaB and MCM proteins for tree reconstructions. Statistical support for the tree was evaluated by the nonparametric version of approximate likelihood-ratio test (SH-aLRT); however, only values for critical nodes, which were higher than 85%, are shown. The intein insertion point(s) ae and abbreviated species names are shown next to branches. Letters for insertion points in DnaB and MCM do not correspond to each other (see B). The trees with full-length species names are available in supplementary figures S5 and S7, Supplementary Material online; the trees reconstructed based on the extended data sets including intein-containing and intein-less proteins are available in supplementary figures S6 and S8, Supplementary Material online. Although DnaB and MCM are functionally equivalent counterparts in bacteria and archaea, these proteins are only distantly related. Bacterial clades as follows: Cyano, Cyanobacteria; Chlorofl, Chloroflexi; Firmi, Firmicutes; Deino, Deionococcus–Thermus; Aquif, Aquificae; Bacter, Bacteroidetes; Ignavi, Ignavibacteriae; Gemmati, Gemmatimonadetes; Proteo, Proteobacteria; Actino, Actinobacteria. Archaeal clades as follows: Halo, Halobacteria; Methm, Methanomicrobia; Aglob, Archaeolglobi; Methb, Methanobacteria; Thermo, Thermococci; Cren, Crenarchaeota. (B) Intein insertion points. Intein locations are shown along DnaB and MCM relative to structural and functional domains. The ATPase domain has multiple intein insertion points in both DnaB and MCM. The important conserved structural motifs within the ATPase domain are shown on the bottom for each protein. The insertion point b in DnaB (red) and insertion point a in MCM (blue) are in functionally equivalent motifs which correspond to P-loops in ATPase. Three other structurally and functionally important motifs are shown for DnaB (H2–H4). Other motifs shown for MCM are: WB, Walker B motif; H2I, β–α–β insert; PS1BH, presensor 1 β-hairpin. MCM protein also carries a nucleic acid-binding domain at the N-terminus (NA-binding domain). (C) Structure models of ATPase domain of DnaB and MCM. Phyre2 models of the ATPase domain are shown (Dte DnaB residues 176–456; Nmo MCM residues 260–696). Dte DnaB (red) has inteins at insertion point a and b, with the T + 1 residues at the intein insertion sites highlighted as gray spheres.
F<sc>ig</sc>. 6.
Fig. 6.
Inteins in bacterial and archaeal clamp loaders and DNA polymerases cluster in functional domains. (A) Distribution of inteins in PolIIIγ and RFC-S. Phylogenetic tree for ATPase domain of the clamp loader proteins from both bacteria and archaea was reconstructed based on the amino acid sequences using the ML algorithm with WAG model. Statistical support for the tree was evaluated with SH-aLRT; however, only values for critical nodes, which were higher than 85%, are shown. The intein insertion point(s) ad and abbreviated species names are shown next to branches. Letters for insertion points in PolIIIγ and RFC-S do not correspond to each other (see B and C). The tree with full-length species names is available in supplementary figure S9, Supplementary Material online; the trees reconstructed based on the extended data sets including intein-containing and intein-less proteins are available in supplementary figures S10 and S11, Supplementary Material online. PolIIIγ inteins were found only in Cyanobacteria (Cyano). Archaeal clades as follows: Thermo, Thermococci; Methc, Methanococci; Methpyr, Methanopyri; Nanoh, Nanohaloarchaeota; Halo, Halobacteria; Aglob, Archaeolglobi. (B) Intein insertion points. Intein locations are shown along PolIIIγ and RFC-S relative to structural and functional domains. The ATPase domain (AAA+ ATPase, black) has a single intein insertion in PolIIIγ (site a shown in red) and multiple intein insertion points in RFC-S (sites ad). The insertion point a in PolIIIγ is located in highly conserved Walker B motif (WB). The most common insertion point a in RFC-S (blue) is located in P-loop. Other motifs shown for RFC-S are: Glu-S, glutamine switch; S1, sensor one. PolIIIγ and RFC-S proteins have additional domains specific for respective proteins: DNA_pol3_gamma3 domain (pink) is found only in PolIIIγ, whereas Rep_fac_C domain (light blue) is present only in RFC-S proteins. (C) Phylogenetic analysis of the C1 inteins from PolIIIγ and RFC-S. Phylogenetic tree was reconstructed based on the intein splicing domain amino acid sequences using the ML algorithm with WAG model. Statistical support for the tree was evaluated with SH-aLRT; however, only values for critical nodes, which were higher than 85%, are shown. Only inteins with cysteine as the first amino acid residue (C1 inteins) were used, which included all inteins identified in PolIIIγ (insertion point a, red), and inteins from insertion points a (blue), c, and d from RFC-S. The intein insertion point(s) a, c, d and abbreviated species names are shown next to branches. The intein insertion point(s) are also indicated in the nodes. The tree with full-length species names is available in supplementary figure S12, Supplementary Material online. (D) Distribution of inteins in PolIIIα and PolB. Phylogenetic trees for bacterial replicative DNA polymerase PolIIIα and archaeal PolB were reconstructed based on the extein amino acid sequences using the ML algorithm with WAG model. Statistical support was evaluated with SH-aLRT; however, only values for critical nodes, which were higher than 85%, are shown. The intein insertion point(s) a–f and abbreviated species names are shown next to branches. Letters for insertion points in PolIIIα and PolB do not correspond to each other (see B). The full-length trees with full-length species names are available in supplementary figures S13 and S14, Supplementary Material online. Although PolIIIα and PolB are functionally equivalent counterparts in bacteria and archaea, these proteins are not related. Bacterial clades as follows: Cyano, Cyanobacteria; Actino, Actinobacteria; Bacter, Bacteroidetes; Deino, Deionococcus–Thermus; Acido, Acidobacteria; Plancto, Planctomycetes; Proteo, Proteobacteria; Aquif, Aquificae; and Firmi, Firmicutes. Archaeal clades as follows: Halo, Halobacteria; Nanoh, Nanohaloarchaeota; Methc, Methanococci; Thermo, Thermococci. (E) Intein insertion points. Intein locations are shown along PolIIIα and PolB relative to structural and functional domains. The critical catalytic domains have multiple intein insertion points in both PolIIIα (pol3_alpha) and PolB (POLBc). Additional insertion points were found in bacterial PHP (polymerase and histidinol hhosphatase domain) for PolIIIα and in archaeal 3′–5′ exo (3′–5′ exonuclease domain of archaeal family-B DNA polymerases) for PolB. Polymerase structural domains are shown on the bottom. PolIIIα inteins from insertion point a are split. Additional abbreviations: HhH, helix-hairpin-helix DNA-binding domain; OBF, (oligonucleotide/oligosaccharide binding)-fold.

References

    1. Allen MJ, Lanzén A, Bratbak G. 2011. Characterisation of the coccolithovirus intein. Mar Genomics. 4:1–7. - PubMed
    1. Amitai G, Dassa B, Pietrokovski S. 2004. Protein splicing of inteins with atypical glutamine and aspartate C-terminal residues. J Biol Chem. 279:3121–3131. - PubMed
    1. Anisimova M, Gascuel O. 2006. Approximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative. Syst Biol. 55:539–552. - PubMed
    1. Anisimova M, Gil M, Dufayard JF, Dessimoz C, Gascuel O. 2011. Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. Syst Biol. 60:685–699. - PMC - PubMed
    1. Barzel A, Naor A, Privman E, Kupiec M, Gophna U. 2011. Homing endonucleases residing within inteins: evolutionary puzzles awaiting genetic solutions. Biochem Soc Trans. 39:169–173. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources