Comparative Study

. 2013 Jan;161(1):210-24.

doi: 10.1104/pp.112.205245. Epub 2012 Nov 6.

Characteristics and significance of intergenic polyadenylated RNA transcription in Arabidopsis

Gaurav D Moghe¹, Melissa D Lehti-Shiu, Alex E Seddon, Shan Yin, Yani Chen, Piyada Juntawong, Federica Brandizzi, Julia Bailey-Serres, Shin-Han Shiu

Affiliations

PMID: 23132786
PMCID: PMC3532253
DOI: 10.1104/pp.112.205245

Comparative Study

Characteristics and significance of intergenic polyadenylated RNA transcription in Arabidopsis

Gaurav D Moghe et al. Plant Physiol. 2013 Jan.

. 2013 Jan;161(1):210-24.

doi: 10.1104/pp.112.205245. Epub 2012 Nov 6.

Authors

Gaurav D Moghe¹, Melissa D Lehti-Shiu, Alex E Seddon, Shan Yin, Yani Chen, Piyada Juntawong, Federica Brandizzi, Julia Bailey-Serres, Shin-Han Shiu

Affiliation

¹ Department of Plant Biology, Michigan State University, East Lansing, Michigan 48824, USA.

PMID: 23132786
PMCID: PMC3532253
DOI: 10.1104/pp.112.205245

Abstract

The Arabidopsis (Arabidopsis thaliana) genome is the most well-annotated plant genome. However, transcriptome sequencing in Arabidopsis continues to suggest the presence of polyadenylated (polyA) transcripts originating from presumed intergenic regions. It is not clear whether these transcripts represent novel noncoding or protein-coding genes. To understand the nature of intergenic polyA transcription, we first assessed its abundance using multiple messenger RNA sequencing data sets. We found 6,545 intergenic transcribed fragments (ITFs) occupying 3.6% of Arabidopsis intergenic space. In contrast to transcribed fragments that map to protein-coding and RNA genes, most ITFs are significantly shorter, are expressed at significantly lower levels, and tend to be more data set specific. A surprisingly large number of ITFs (32.1%) may be protein coding based on evidence of translation. However, our results indicate that these "translated" ITFs tend to be close to and are likely associated with known genes. To investigate if ITFs are under selection and are functional, we assessed ITF conservation through cross-species as well as within-species comparisons. Our analysis reveals that 237 ITFs, including 49 with translation evidence, are under strong selective constraint and relatively distant from annotated features. These ITFs are likely parts of novel genes. However, the selective pressure imposed on most ITFs is similar to that of randomly selected, untranscribed intergenic sequences. Our findings indicate that despite the prevalence of ITFs, apart from the possibility of genomic contamination, many may be background or noisy transcripts derived from "junk" DNA, whose production may be inherent to the process of transcription and which, on rare occasions, may act as catalysts for the creation of novel genes.

PubMed Disclaimer

Figures

**Figure 1.**
Characteristics of Set 1 and Set 2 TxFrags. A and B, Length (A) and expression level distributions (B) of various genomic features, proteins (blue), RNA (green), pseudogenes (red), transposons (orange), and ITFs (black), based on Set 1 TxFrags identified across all eight RNA-seq data sets. Both axes are logarithmically scaled with base 10. To emphasize the lower peaks, curves beyond the black dashed line are truncated. C, Percentage of transposons considered expressed based on Set 1 TxFrags identified from eight data sets at various FPKM thresholds. For details of other features, see Supplemental Table S1. D and E, Length (D) and expression level distributions (E) for Set 2 TxFrags. F, Percentage of transposons considered expressed based on Set 2 TxFrags. G and H, Percentage of TxFrags defined as intergenic at different FPKM thresholds among data sets. Set 1 TxFrags (G) and Set 2 TxFrags (H) were identified as intergenic without an FPKM threshold (black) and at progressively more stringent FPKM thresholds according to transposon-based false-positive (FP) rates of 1%, 2%, 5%, 7%, and 10%. The x axis indicates the data sets used to identify TxFrags. The y axis represents the percentage of true-positive TxFrags that are intergenic at each FP threshold. Note that the percentage did not monotonically decrease because some TxFrags overlapping with annotated features also were filtered out when false-positive thresholds were applied.

**Figure 2.**
In vivo translation of predicted protein-coding sequences in transiently transformed tobacco leaf epidermal cells. A, Enhanced YFP (EYFP) is localized to the cytoplasm (orange arrow) and nucleus (white arrow). B and C, AT_3|+|1|14212973-14213269-EYFP (B) and AT_1|-|2|20126281-20126376-EYFP (C) have similar localization patterns. Nuclei are indicated by white arrows. In C, a series of 18 slices (1 µm each) was merged to highlight cytoplasmic strands (orange arrow). D and E, AT_1|-|2|5786755-5786853-EYFP appears to be vesicle localized (white arrow; D), similar to endoplasmic reticulum/Golgi marker ERD2-GFP (white arrow; E). F, AT_1|-|2|5786755-5786853-EYFP (recently annotated as At1g16916; red) does not colocalize with ERD2-GFP (green). G and H, AT_3|-|0|3663786-3663977-EYFP (G) and AT_3|+|2|4574607-4574900-EYFP (H) also have punctate expression patterns (white arrows). The orange arrow in G indicates potential aggregation of the AT_3|-|0|3663786-3663977-EYFP fusion protein. I, AT_1|+|1|11469497-11469754-EYFP appears to localize to the endoplasmic reticulum and nuclear envelope (orange arrow), similar to ERD2 (orange arrow in E). J to L, sORFs in a known noncoding small nucleolar RNA At1g12013 (J) and in an intron of a protein-coding gene, At1g43560 (K), are not translated based on a signal similar to a leaf infiltrated with *A. tumefaciens* lacking a fusion protein construct (L). Signal observed in K and L (white arrows) is likely due to cell damage. Bars = 10 µm. Names of all protein-coding sequences are as previously published (Hanada et al., 2007).

**Figure 3.**
Translation evidence, breadth, and distance of ITFs from neighboring genes. A, Percentage of features with overlapping translation evidence was calculated for protein-coding genes, RNA genes (excluding other RNA), pseudogenes, transposons, and ITFs obtained from the 7-d seedling and flower transcriptomes. Ribosome immunoprecipitation data are for *AGAMOUS* (AG), *APETALA1* (*AP1*), *AP3*, flower, and 7-d seedling. Proteomics data are combined data from two studies. Only uniquely mapping R-TxFrags and proteomics tags were used as evidence. B, Breadth of expression (as indicated by the number of data sets where a feature can be found) of ITFs (black) and TxFrags mapped to protein-coding genes (blue), RNA genes (red), pseudogenes (green), and transposons (orange). CDS, Coding sequence. C, Distance distribution of ITFs to their nearest protein-coding genes. The box plots depict distance distributions between 10,000 sets of randomly sampled intergenic sequences and their nearest protein-coding genes. D, Percentage of translated ITFs over all ITFs in the same distance bin is shown as a function of distance to the nearest protein-coding gene. ITFs neighboring proteins with and without transcript evidence are represented by red and blue lines, respectively. Box plots represent the randomly expected proportions in each distance bin obtained by permuting the association between distance and presence/absence of translation evidence. The medians of random expectations are approximately 35%, because approximately 35% of ITFs have one or more pieces of translation evidence.

**Figure 4.**
Evolutionary conservation of ITF sequences. A, Between-species nucleotide substitution rate distributions of different features and 4-fold degenerate sites (4x). CDS, Coding sequence. B, Substitution rates of ITFs compared with local substitution rates of 4x sites. 4x sites of up to 60 neighboring protein-coding genes were used to determine the distributions of local substitution rates. Black circles indicate medians of the distributions, gray lines define the interquartile ranges, and each orange or blue circle indicates the substitution rate of the ITF in the given region. The ITFs are arranged from low to high z scores. An orange circle indicates a significant z score at P < 0.05, while a blue circle indicates P ≥ 0.05. C to F, Heat maps indicating degree of cross-species similarity of ITFs with translation evidence (C), ITFs without translation evidence (D), 10,000 randomly selected TxFrags mapped to annotated protein-coding genes (E), and all TxFrags mapped to annotated RNAs (F). TxFrags mapping to proteins and annotated RNAs were chosen based on the size distribution of the ITFs. Each row represents a feature, and each column represents the subject species for similarity search. The expect (E) values were converted to a negative logarithmic scale and adjusted to be between 0 and 10, with 0 (blue) indicating E ≥ 1 and 10 (yellow) indicating E ≤ 1e-10.

**Figure 5.**
Distribution of π values for genomic features. The π values were calculated using population genomic data of 80 Arabidopsis accessions. x-sp, Cross species. Random intergenic sequences were selected from regions without transcript support.

See this image and copyright information in PMC

Cited by

Non-Coding RNAs and their Integrated Networks.
Zhang P, Wu W, Chen Q, Chen M. Zhang P, et al. J Integr Bioinform. 2019 Jul 13;16(3):20190027. doi: 10.1515/jib-2019-0027. J Integr Bioinform. 2019. PMID: 31301674 Free PMC article. Review.
Genome-wide characterization of intergenic polyadenylation sites redefines gene spaces in Arabidopsis thaliana.
Wu X, Zeng Y, Guan J, Ji G, Huang R, Li QQ. Wu X, et al. BMC Genomics. 2015 Jul 9;16(1):511. doi: 10.1186/s12864-015-1691-1. BMC Genomics. 2015. PMID: 26155789 Free PMC article.
Deep analysis of wild Vitis flower transcriptome reveals unexplored genome regions associated with sex specification.
Ramos MJ, Coito JL, Fino J, Cunha J, Silva H, de Almeida PG, Costa MM, Amâncio S, Paulo OS, Rocheta M. Ramos MJ, et al. Plant Mol Biol. 2017 Jan;93(1-2):151-170. doi: 10.1007/s11103-016-0553-9. Epub 2016 Oct 24. Plant Mol Biol. 2017. PMID: 27778293
Identification and functional annotation of long intergenic non-coding RNAs in Brassicaceae.
Palos K, Nelson Dittrich AC, Yu L, Brock JR, Railey CE, Wu HL, Sokolowska E, Skirycz A, Hsu PY, Gregory BD, Lyons E, Beilstein MA, Nelson ADL. Palos K, et al. Plant Cell. 2022 Aug 25;34(9):3233-3260. doi: 10.1093/plcell/koac166. Plant Cell. 2022. PMID: 35666179 Free PMC article.
Robust predictions of specialized metabolism genes through machine learning.
Moore BM, Wang P, Fan P, Leong B, Schenck CA, Lloyd JP, Lehti-Shiu MD, Last RL, Pichersky E, Shiu SH. Moore BM, et al. Proc Natl Acad Sci U S A. 2019 Feb 5;116(6):2344-2353. doi: 10.1073/pnas.1817074116. Epub 2019 Jan 23. Proc Natl Acad Sci U S A. 2019. PMID: 30674669 Free PMC article.

See all "Cited by" articles

References

1. Agarwal A, Koppstein D, Rozowsky J, Sboner A, Habegger L, Hillier LW, Sasidharan R, Reinke V, Waterston RH, Gerstein M. (2010) Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays. BMC Genomics 11: 383. - PMC - PubMed
1. Armour CD, Castle JC, Chen R, Babak T, Loerch P, Jackson S, Shah JK, Dey J, Rohl CA, Johnson JM, et al. (2009) Digital transcriptome profiling using selective hexamer priming for cDNA synthesis. Nat Methods 6: 647–649 - PubMed
1. Aubourg S, Martin-Magniette M-L, Brunaud V, Taconnat L, Bitton F, Balzergue S, Jullien PE, Ingouff M, Thareau V, Schiex T, et al. (2007) Analysis of CATMA transcriptome data identifies hundreds of novel functional genes and improves gene models in the Arabidopsis genome. BMC Genomics 8: 401. - PMC - PubMed
1. Baerenfaller K, Grossmann J, Grobei MA, Hull R, Hirsch-Hoffmann M, Yalovsky S, Zimmermann P, Grossniklaus U, Gruissem W, Baginsky S. (2008) Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science 320: 938–941 - PubMed
1. Bailey-Serres J, Sorenson R, Juntawong P. (2009) Getting the message across: cytoplasmic ribonucleoprotein complexes. Trends Plant Sci 14: 443–453 - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Characteristics and significance of intergenic polyadenylated RNA transcription in Arabidopsis

Affiliation

Characteristics and significance of intergenic polyadenylated RNA transcription in Arabidopsis

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources