. 2016 Oct 27;538(7626):533-536.

doi: 10.1038/nature20110. Epub 2016 Oct 19.

Mechanism for DNA transposons to generate introns on genomic scales

Jason T Huff^{1

2}, Daniel Zilberman¹, Scott W Roy³

Affiliations

¹ Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA.
² California Institute for Quantitative Biosciences, University of California, Berkeley, California 94720, USA.
³ Department of Biology, San Francisco State University, San Francisco, California 94132, USA.

PMID: 27760113
PMCID: PMC5684705
DOI: 10.1038/nature20110

Mechanism for DNA transposons to generate introns on genomic scales

Jason T Huff et al. Nature. 2016.

. 2016 Oct 27;538(7626):533-536.

doi: 10.1038/nature20110. Epub 2016 Oct 19.

Authors

Jason T Huff^{1

2}, Daniel Zilberman¹, Scott W Roy³

Affiliations

¹ Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA.
² California Institute for Quantitative Biosciences, University of California, Berkeley, California 94720, USA.
³ Department of Biology, San Francisco State University, San Francisco, California 94132, USA.

PMID: 27760113
PMCID: PMC5684705
DOI: 10.1038/nature20110

Abstract

The discovery of introns four decades ago was one of the most unexpected findings in molecular biology. Introns are sequences interrupting genes that must be removed as part of messenger RNA production. Genome sequencing projects have shown that most eukaryotic genes contain at least one intron, and frequently many. Comparison of these genomes reveals a history of long evolutionary periods during which few introns were gained, punctuated by episodes of rapid, extensive gain. However, although several detailed mechanisms for such episodic intron generation have been proposed, none has been empirically supported on a genomic scale. Here we show how short, non-autonomous DNA transposons independently generated hundreds to thousands of introns in the prasinophyte Micromonas pusilla and the pelagophyte Aureococcus anophagefferens. Each transposon carries one splice site. The other splice site is co-opted from the gene sequence that is duplicated upon transposon insertion, allowing perfect splicing out of the RNA. The distributions of sequences that can be co-opted are biased with respect to codons, and phasing of transposon-generated introns is similarly biased. These transposons insert between pre-existing nucleosomes, so that multiple nearby insertions generate nucleosome-sized intervening segments. Thus, transposon insertion and sequence co-option may explain the intron phase biases and prevalence of nucleosome-sized exons observed in eukaryotes. Overall, the two independent examples of proliferating elements illustrate a general DNA transposon mechanism that can plausibly account for episodes of rapid, extensive intron gain during eukaryotic evolution.

PubMed Disclaimer

Figures

**Extended Data Figure 1. *M. pusilla* IEs are in phase with nucleosome linker DNA, even without methylation**
Unmethylated regions (indicated by the line with arrowheads) are defined as containing no base positions with fractional methylation 0.5 or greater in a window starting from 50 bp upstream of the 5′ end of the IE intron and continuing 234 bp downstream, which is 50 bp beyond the predominant *M. pusilla* IE intron size of 184 bp (Fig. 2a). Mean values at each base positions are shown for chromatin maps aligned to the subset (7%) of IE introns residing in unmethylated regions (dark gray and dark blue for nucleosomes centers and DNA methylation, respectively), compared with alignment to all IE introns (light gray and light blue; same data as in Fig. 1b for IE introns). On the other hand, to assess if IEs could be in phase with methylated regions that are not also nucleosome linkers, we looked for IEs that had both ends in methylated DNA regions but not in nucleosome linkers, which gave 35 potential candidates (1% of IEs). Manual inspection revealed that 34 of the 35 apparently nonetheless have ends in nucleosome linkers, simply being missed by the filtering criteria we used for calling linkers. This leaves 1 candidate, indicating little evidence that DNA methylated regions are found at IE ends, which are not also nucleosome linkers. Taken together, unmethylated nucleosome linkers could be the primary determinant of IE insertion in at least some cases, whereas we find virtually no evidence that methylated regions could be the primary determinant of IE insertion without also being nucleosome linkers.

**Extended Data Figure 2. *A. anophagefferens* IEs insert into preexisting nucleosome linkers**
a, IE introns are generally in phase with nucleosome positions, whereas other introns are not. DNA methylation was aligned to the 5′ ends of IE introns (dark blue) or other introns (light blue). We did not generate nucleosome data previously for *A. anophagefferens* but DNA methylation is a reliable indicator of linker locations. b, IEs are in phase with the starts of genes, indicating insertion between preexisting nucleosomes. The 5′ ends of IE introns and DNA methylation were aligned to gene starts. A kernel density estimate of IE ends is displayed with peaks marked by vertical broken lines.

**Extended Data Figure 3. Target site duplications (TSDs) at IE introns**
a and c, Intron sequences contain directly repeated sequences at their ends. Each *A. anophagefferens* (a) and *M. pusilla* (c) intron 5′ and 3′ end is directly aligned in each possible offset from -10 to 10 bp apart. Positions relative to the 5′ splice site from 10 bp upstream to 10 bp downstream are shown. IE introns are shown at left and other regular non-IE introns are in center, and the differences of subtracting the identity percentages of other introns from those of IE introns are at right. Each panel is separated by a vertical black line and a diagonally stepped black line to delineate different regions: the upper left region represents alignment of upstream exon versus 3′ intron end sequence; the upper right represents 5′ intron end versus 3′ intron end; the lower right represents 5′ intron end versus downstream exon; and the lower left represents upstream exon versus downstream exon. The red arrowheads at right indicate the offset with maximum average identity (0 in both cases). The red boxes in the right panels highlight the identified TSD length and position (see Supplementary Discussion). b and d, An example of an aligned 5′ (above) and 3′ (below) intron end of an IE for the offset with maximum identity is shown in (b) for *A. anophagefferens* and (d) for *M. pusilla*. Exonic sequence is uppercase and boxed; intronic is lowercase. Vertical lines show identities that are part of at least an identical 2-mer with the red lines corresponding to the boxed regions in panels a and c.

**Extended Data Figure 4. Terminal inverted repeats (TIRs) in IE introns**
a and c, Intron end sequences contain inverted repeats. Each *A. anophagefferens* (a) and *M. pusilla* (c) intron 5′ and reverse of the 3′ end is aligned in each possible offset from -30 to 30 bp apart. Positions relative to the 5′ splice site from 30 bp upstream to 30 bp downstream are shown. IE introns are shown at left and other regular non-IE introns are at right. In each panel the upper left region represents upstream exon versus downstream exon sequence, the upper right represents 5′ intron end versus downstream exon, the lower right represents 5′ intron end versus 3′ intron end, and the lower left represents upstream exon versus 3′ intron end. The red arrowheads at right indicate the offset with maximum average complementarity. b and d, An example of an aligned 5′ (top) and 3′ (bottom, reversed so that it is 3′ to 5′) end of an IE intron for the offset with maximum complementarity is shown in (b) for *A. anophagefferens* (offset of +8) and (d) for *M. pusilla* (offset of -5). Exonic sequence is uppercase and boxed; intronic is lowercase. Vertical lines show complementarities that are part of at least an identical 2-mer.

**Extended Data Figure 5. Intron gain templated by nucleosomes and co-opted sequences**
Model for intron generation by IEs acting as short non-autonomous DNA transposons that carry a splice site and insert between nucleosomes with co-option of the other splice site sequence.

Extended Data Figure 6. Diploid genomic sequence variation in a more recent isolate of *A. anophagefferens*
a, Calling of sequence variation from genomic sequencing reads without an assumption of ploidy reveals a peak at alternate allele fraction of approximately 0.5. The most likely scenario is that this *A. anophagefferens* isolate has a diploid genome. It is not physically plausible for it to have higher ploidy because that amount of chromatin could not fit into its extremely compact nucleus. b, An example reference IE is present within one allele and absent within the alternate allele. The locus is displayed as in Fig. 3a. The reference IE is located in an annotated protein-coding gene with a 200 bp RNA sequencing-validated intron in the reference isolate. The alternate allele is likely exonic without an intron (broken lines), so that it encodes the same amino acid sequence. The TSD within the reference allele is 8 bp, immediately flanking the IE TIRs. c, An example IE not found within the reference allele is present within the alternate allele. The locus is displayed as in Fig. 3a. The alternate IE is within an annotated protein-coding gene with a predicted 200 bp intron (broken lines). If the predicted intron is indeed spliced out of the RNA, then the alternate allele encodes the same amino acid sequence. The TSD within the alternate allele is 8 bp, immediately flanking the IE TIRs.

**Extended Data Figure 7. Splice site sequences**
Logos for the 10 bp upstream and downstream of 5′ and 3′ splice sites for IE and other introns are shown for each organism. The rectangles show exonic positions. The core splice sites are GY (Y is C or T) and AG, respectively. IEs combined with co-opted exonic sequence that is duplicated (Fig. 3) to generate particular sequences that extend beyond the core sites (bracketed). Specifically, this results in a predominance of AG|GY sequences (“|” denotes the position of splicing that ultimately occurs) at 5′ splice sites in *M. pusilla* IE introns and 3′ splice sites in *A. anophagefferens* IE introns. Similar respective sequences are observed in other introns in each organism: G|GT for *M. pusilla* 5′ splice sites and AG|G for *A. anophagefferens* 3′ splice sites. In non-IE introns, these sequences have been under selection for long periods of time to promote RNA splicing, revealing the sequences extending beyond core sites that probably contribute to optimal splicing in each organism. The similarity of IE intron splice sites to other inton splice sites thus suggests that IEs in each organism generate new introns that are spliced reasonably well.

**Extended Data Figure 8. Most IEs are located in genes expressing low to average RNA levels**
Distributions of detectable RNA levels of all transcripts (black) and only those containing at least 1 IE (green) are shown as measured by RNA sequencing. Box plots indicate the median, 1^st and 3^rd quartiles with whiskers extending up to data 1.5 times the interquartile range away from the box. For *M. pusilla*, IE-containing gene expression does not significantly differ from that of all genes, P=0.59. For *A. anophagefferens*, IE-containing gene expression is slightly lower than that of all genes, P=0.041.

**Figure 1. *M. pusilla* IEs insert between preexisting nucleosomes**
a, Each IE contains a nucleosome with ends in linker DNA, which is specifically marked by methylation in this organism. Validated introns and chromatin data are displayed. *HEME1* contains 2 IEs (green). b, IE introns are generally in phase with nucleosome positions, whereas other introns are not. Chromatin maps are aligned to 5′ IE intron ends (dark lines) or other intron ends (light lines). c, IEs are in phase with the starts of genes, indicating insertion between preexisting nucleosomes. Chromatin maps and 5′ IE ends are aligned to gene starts. A kernel density estimate of IE ends is shown with peaks marked.

Figure 2. Identification of IEs in *A. anophagefferens*
a, Validated lengths for IE (blue) and other (gray) introns. b, *A. anophagefferens* IEs share sequence similarity in intronic, not in neighboring exonic sequence. Six example IEs contain regions with maximal pairwise identities from 96 to 100%. Bases position identities in at least 5 of the 6 sequences are green. c, Most *A. anophagefferens* IEs can be aligned to form one or more related groups. Nodes present in >50% of 1,000 bootstraps are indicated with black dots on the ML tree. IEs are found in either orientation with respect to the intron (orange and blue). Many elements carry 3′ splice sites in both orientations (black lines at right).

**Figure 3. IEs are DNA transposons that carry a splice site and co-opt the other**
a, IEs (green) exhibit hallmarks of DNA transposons. Direct duplications (bold; target site duplications, TSDs) of 8 bp and 3 bp particular to *A. anophagefferens* and *M. pusilla* IEs, respectively, are adjacent to the ends. Inverted repeats (underlined) are at IE ends (terminal inverted repeats, TIRs). b, IEs carry one splice site and co-opt the other. Logos for the ends of the most abundant intron size classes are shown: 200 bp for *A. anophagefferens* and 184 bp for *M. pusilla*. In *A. anophagefferens* the 5′ splice site (bracketed) is constructed from a TSD (gene sequence before duplication), and the 3′ splice site (underlined) is carried in a transposon TIR. In *M. pusilla* the 5′ splice site (underlined) is carried in a transposon TIR and the 3′ splice site (bracketed) is constructed from a TSD.

**Figure 4. IE dynamics and genomic implications**
a, Presence-absence variation in a newer isolate of *A. anophagefferens*. *Non-reference IEs identified cannot be absent/absent. b, Sequences that can be co-opted to construct splice sites are biased with respect to codon phasing. For *M. pusilla*, IE introns should be biased by availability of AG sequences that can be co-opted as 3′ splice sites (3′ss). For *A. anophagefferens*, IE introns should be biased by availability of GY (Y is C or T) sequences that can be co-opted for 5′ splice sites (5′ss). IE introns indeed have phase biases more similar to the respectively co-opted sequence (bold). c, Nearby IE insertions generate nucleosome-sized segments. Distances between neighboring IE introns (solid) and between other neighboring introns (broken) are displayed as kernel density estimates. Nucleosome repeat lengths of 206 bp for *M. pusilla* and 168 bp for *A. anophagefferens* show the expected sizes of integer numbers of nucleosomes (vertical lines).

See this image and copyright information in PMC

References

1. Gilbert W. Why genes in pieces? Nature. 1978;271:501. - PubMed
1. Rogozin IB, Carmel L, Csuros M, Koonin EV. Origin and evolution of spliceosomal introns. Biol. Direct. 2012;7:11. - PMC - PubMed
1. Irimia M, Roy SW. Origin of spliceosomal introns and alternative splicing. Cold Spring Harb. Perspect. Biol. 2014;6:a016071. - PMC - PubMed
1. Cavalier-Smith T. Selfish DNA and the origin of introns. Nature. 1985;315:283–284. - PubMed
1. Purugganan M, Wessler S. The splicing of transposable elements and its role in intron evolution. Genetica. 1992;86:295–303. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Mechanism for DNA transposons to generate introns on genomic scales

Affiliations

Mechanism for DNA transposons to generate introns on genomic scales

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources