Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2013 Jan;161(1):210-24.
doi: 10.1104/pp.112.205245. Epub 2012 Nov 6.

Characteristics and significance of intergenic polyadenylated RNA transcription in Arabidopsis

Affiliations
Comparative Study

Characteristics and significance of intergenic polyadenylated RNA transcription in Arabidopsis

Gaurav D Moghe et al. Plant Physiol. 2013 Jan.

Abstract

The Arabidopsis (Arabidopsis thaliana) genome is the most well-annotated plant genome. However, transcriptome sequencing in Arabidopsis continues to suggest the presence of polyadenylated (polyA) transcripts originating from presumed intergenic regions. It is not clear whether these transcripts represent novel noncoding or protein-coding genes. To understand the nature of intergenic polyA transcription, we first assessed its abundance using multiple messenger RNA sequencing data sets. We found 6,545 intergenic transcribed fragments (ITFs) occupying 3.6% of Arabidopsis intergenic space. In contrast to transcribed fragments that map to protein-coding and RNA genes, most ITFs are significantly shorter, are expressed at significantly lower levels, and tend to be more data set specific. A surprisingly large number of ITFs (32.1%) may be protein coding based on evidence of translation. However, our results indicate that these "translated" ITFs tend to be close to and are likely associated with known genes. To investigate if ITFs are under selection and are functional, we assessed ITF conservation through cross-species as well as within-species comparisons. Our analysis reveals that 237 ITFs, including 49 with translation evidence, are under strong selective constraint and relatively distant from annotated features. These ITFs are likely parts of novel genes. However, the selective pressure imposed on most ITFs is similar to that of randomly selected, untranscribed intergenic sequences. Our findings indicate that despite the prevalence of ITFs, apart from the possibility of genomic contamination, many may be background or noisy transcripts derived from "junk" DNA, whose production may be inherent to the process of transcription and which, on rare occasions, may act as catalysts for the creation of novel genes.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Characteristics of Set 1 and Set 2 TxFrags. A and B, Length (A) and expression level distributions (B) of various genomic features, proteins (blue), RNA (green), pseudogenes (red), transposons (orange), and ITFs (black), based on Set 1 TxFrags identified across all eight RNA-seq data sets. Both axes are logarithmically scaled with base 10. To emphasize the lower peaks, curves beyond the black dashed line are truncated. C, Percentage of transposons considered expressed based on Set 1 TxFrags identified from eight data sets at various FPKM thresholds. For details of other features, see Supplemental Table S1. D and E, Length (D) and expression level distributions (E) for Set 2 TxFrags. F, Percentage of transposons considered expressed based on Set 2 TxFrags. G and H, Percentage of TxFrags defined as intergenic at different FPKM thresholds among data sets. Set 1 TxFrags (G) and Set 2 TxFrags (H) were identified as intergenic without an FPKM threshold (black) and at progressively more stringent FPKM thresholds according to transposon-based false-positive (FP) rates of 1%, 2%, 5%, 7%, and 10%. The x axis indicates the data sets used to identify TxFrags. The y axis represents the percentage of true-positive TxFrags that are intergenic at each FP threshold. Note that the percentage did not monotonically decrease because some TxFrags overlapping with annotated features also were filtered out when false-positive thresholds were applied.
Figure 2.
Figure 2.
In vivo translation of predicted protein-coding sequences in transiently transformed tobacco leaf epidermal cells. A, Enhanced YFP (EYFP) is localized to the cytoplasm (orange arrow) and nucleus (white arrow). B and C, AT_3|+|1|14212973-14213269-EYFP (B) and AT_1|-|2|20126281-20126376-EYFP (C) have similar localization patterns. Nuclei are indicated by white arrows. In C, a series of 18 slices (1 µm each) was merged to highlight cytoplasmic strands (orange arrow). D and E, AT_1|-|2|5786755-5786853-EYFP appears to be vesicle localized (white arrow; D), similar to endoplasmic reticulum/Golgi marker ERD2-GFP (white arrow; E). F, AT_1|-|2|5786755-5786853-EYFP (recently annotated as At1g16916; red) does not colocalize with ERD2-GFP (green). G and H, AT_3|-|0|3663786-3663977-EYFP (G) and AT_3|+|2|4574607-4574900-EYFP (H) also have punctate expression patterns (white arrows). The orange arrow in G indicates potential aggregation of the AT_3|-|0|3663786-3663977-EYFP fusion protein. I, AT_1|+|1|11469497-11469754-EYFP appears to localize to the endoplasmic reticulum and nuclear envelope (orange arrow), similar to ERD2 (orange arrow in E). J to L, sORFs in a known noncoding small nucleolar RNA At1g12013 (J) and in an intron of a protein-coding gene, At1g43560 (K), are not translated based on a signal similar to a leaf infiltrated with A. tumefaciens lacking a fusion protein construct (L). Signal observed in K and L (white arrows) is likely due to cell damage. Bars = 10 µm. Names of all protein-coding sequences are as previously published (Hanada et al., 2007).
Figure 3.
Figure 3.
Translation evidence, breadth, and distance of ITFs from neighboring genes. A, Percentage of features with overlapping translation evidence was calculated for protein-coding genes, RNA genes (excluding other RNA), pseudogenes, transposons, and ITFs obtained from the 7-d seedling and flower transcriptomes. Ribosome immunoprecipitation data are for AGAMOUS (AG), APETALA1 (AP1), AP3, flower, and 7-d seedling. Proteomics data are combined data from two studies. Only uniquely mapping R-TxFrags and proteomics tags were used as evidence. B, Breadth of expression (as indicated by the number of data sets where a feature can be found) of ITFs (black) and TxFrags mapped to protein-coding genes (blue), RNA genes (red), pseudogenes (green), and transposons (orange). CDS, Coding sequence. C, Distance distribution of ITFs to their nearest protein-coding genes. The box plots depict distance distributions between 10,000 sets of randomly sampled intergenic sequences and their nearest protein-coding genes. D, Percentage of translated ITFs over all ITFs in the same distance bin is shown as a function of distance to the nearest protein-coding gene. ITFs neighboring proteins with and without transcript evidence are represented by red and blue lines, respectively. Box plots represent the randomly expected proportions in each distance bin obtained by permuting the association between distance and presence/absence of translation evidence. The medians of random expectations are approximately 35%, because approximately 35% of ITFs have one or more pieces of translation evidence.
Figure 4.
Figure 4.
Evolutionary conservation of ITF sequences. A, Between-species nucleotide substitution rate distributions of different features and 4-fold degenerate sites (4x). CDS, Coding sequence. B, Substitution rates of ITFs compared with local substitution rates of 4x sites. 4x sites of up to 60 neighboring protein-coding genes were used to determine the distributions of local substitution rates. Black circles indicate medians of the distributions, gray lines define the interquartile ranges, and each orange or blue circle indicates the substitution rate of the ITF in the given region. The ITFs are arranged from low to high z scores. An orange circle indicates a significant z score at P < 0.05, while a blue circle indicates P ≥ 0.05. C to F, Heat maps indicating degree of cross-species similarity of ITFs with translation evidence (C), ITFs without translation evidence (D), 10,000 randomly selected TxFrags mapped to annotated protein-coding genes (E), and all TxFrags mapped to annotated RNAs (F). TxFrags mapping to proteins and annotated RNAs were chosen based on the size distribution of the ITFs. Each row represents a feature, and each column represents the subject species for similarity search. The expect (E) values were converted to a negative logarithmic scale and adjusted to be between 0 and 10, with 0 (blue) indicating E ≥ 1 and 10 (yellow) indicating E ≤ 1e-10.
Figure 5.
Figure 5.
Distribution of π values for genomic features. The π values were calculated using population genomic data of 80 Arabidopsis accessions. x-sp, Cross species. Random intergenic sequences were selected from regions without transcript support.

Similar articles

Cited by

References

    1. Agarwal A, Koppstein D, Rozowsky J, Sboner A, Habegger L, Hillier LW, Sasidharan R, Reinke V, Waterston RH, Gerstein M. (2010) Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays. BMC Genomics 11: 383. - PMC - PubMed
    1. Armour CD, Castle JC, Chen R, Babak T, Loerch P, Jackson S, Shah JK, Dey J, Rohl CA, Johnson JM, et al. (2009) Digital transcriptome profiling using selective hexamer priming for cDNA synthesis. Nat Methods 6: 647–649 - PubMed
    1. Aubourg S, Martin-Magniette M-L, Brunaud V, Taconnat L, Bitton F, Balzergue S, Jullien PE, Ingouff M, Thareau V, Schiex T, et al. (2007) Analysis of CATMA transcriptome data identifies hundreds of novel functional genes and improves gene models in the Arabidopsis genome. BMC Genomics 8: 401. - PMC - PubMed
    1. Baerenfaller K, Grossmann J, Grobei MA, Hull R, Hirsch-Hoffmann M, Yalovsky S, Zimmermann P, Grossniklaus U, Gruissem W, Baginsky S. (2008) Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science 320: 938–941 - PubMed
    1. Bailey-Serres J, Sorenson R, Juntawong P. (2009) Getting the message across: cytoplasmic ribonucleoprotein complexes. Trends Plant Sci 14: 443–453 - PubMed

Publication types

MeSH terms

LinkOut - more resources