Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs

Matthew J Hangauer¹, Ian W Vaughn, Michael T McManus

Affiliations

PMID: 23818866
PMCID: PMC3688513
DOI: 10.1371/journal.pgen.1003569

Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs

Matthew J Hangauer et al. PLoS Genet. 2013 Jun.

. 2013 Jun;9(6):e1003569.

doi: 10.1371/journal.pgen.1003569. Epub 2013 Jun 20.

Authors

Matthew J Hangauer¹, Ian W Vaughn, Michael T McManus

Affiliation

¹ Diabetes Center, Department of Microbiology and Immunology, University of California, San Francisco, California, USA.

PMID: 23818866
PMCID: PMC3688513
DOI: 10.1371/journal.pgen.1003569

Abstract

Known protein coding gene exons compose less than 3% of the human genome. The remaining 97% is largely uncharted territory, with only a small fraction characterized. The recent observation of transcription in this intergenic territory has stimulated debate about the extent of intergenic transcription and whether these intergenic RNAs are functional. Here we directly observed with a large set of RNA-seq data covering a wide array of human tissue types that the majority of the genome is indeed transcribed, corroborating recent observations by the ENCODE project. Furthermore, using de novo transcriptome assembly of this RNA-seq data, we found that intergenic regions encode far more long intergenic noncoding RNAs (lincRNAs) than previously described, helping to resolve the discrepancy between the vast amount of observed intergenic transcription and the limited number of previously known lincRNAs. In total, we identified tens of thousands of putative lincRNAs expressed at a minimum of one copy per cell, significantly expanding upon prior lincRNA annotation sets. These lincRNAs are specifically regulated and conserved rather than being the product of transcriptional noise. In addition, lincRNAs are strongly enriched for trait-associated SNPs suggesting a new mechanism by which intergenic trait-associated regions may function. These findings will enable the discovery and interrogation of novel intergenic functional elements.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. The human intergenic transcriptome.**
(A) 85.2% of the genome has evidence of transcription, with RNA-seq reads mapping directly to 78.9% of genomic sequence. The remaining genomic coverage is comprised of known genes, spliced ESTs and spliced cDNAs. The grey circle represents the portion of the genome (83.4%) that is uniquely mappable with RNA-seq reads. (B) Protein coding (NM gene) exon, intron and intergenic region expression level distribution. Regions that have high expression have a larger fraction of base calls appearing at higher read depths. Protein coding gene exons have the highest proportion of bases with high read depth, while introns and intergenic regions have relatively more bases of low read depth though each contain many highly expressed regions. Base calls = (# of genomic positions at a specific read depth)(read depth). (C) Most intergenic transcription is outside of annotated noncoding RNA genes. The fraction of intergenic base calls within RefSeq noncoding RNA genes (NR genes) compared to other intergenic regions are compared. In (A–C), only uniquely mappable portions of the genome are considered (see Methods).

**Figure 2. Discovery of lincRNAs.**
(A) Discovery of lincRNAs consisted of *de novo* assembly of transcripts from RNA-seq data and compilation of annotated and putative noncoding RNAs (see Methods), followed by a series of filters designed to remove all known and novel protein coding transcripts and non-lincRNA noncoding RNAs. Only intergenic noncoding transcripts at least 200 nucleotides in length and expressed at least at one copy per cell were ultimately annotated as lincRNAs. (B) Analysis of ribosomal profiling data reveals that the lincRNA catalog is composed of noncoding transcripts. The maximum 30 bp window ratio of HeLa ribosomal/RNA-seq reads is plotted for exons of lincRNAs, 3′ UTRs and coding sequences (CDS). *P<2.2E-16; whiskers extend +/−1.5 times interquartile range and dots represent outliers. (C) Computational analysis of protein coding capacity of the lincRNAs reveals a lack of protein coding capacity. The cumulative distribution of PhyloCSF scores for lincRNAs and RefSeq NM genes are plotted. Higher scores correspond to higher predicted coding capacity.

**Figure 3. LincRNAs possess features inconsistent with transcriptional noise.**
(A) ChIP-seq and RNA-seq data from IMR90 cells , were analyzed for lincRNAs and RefSeq NM genes. *P = 4.01E-7, ** P = 4.52E-9, *** P = 2.43E-14, **** P<2.2E-16; P = 0.137 for lincRNAs H3K9me3; whiskers extend to +/−1.5 times interquartile range or most extreme data point. (B) LincRNA FPKM values in polyA+ specific and polyA− specific RNA-seq libraries in H9 ESCs and HeLa cells were compared. Transcripts with RNA-seq reads in all four datasets and with FPKM>1 in at least one of the two fractions for each cell type were analyzed (16,819 NM genes and 127 lincRNAs). Individual lincRNA and NM gene ratios of FPKMs in polyA+/polyA− fractions are plotted. Pearson correlation value for lincRNAs = 0.622 (P = 5.551E-15) and for NM genes = 0.702 (P<2.2E-16). (C) The maximally conserved 50 bp windows in each NM gene, lincRNA, and repetitive element (nonconserved control sequences) were determined. The maximally conserved 50 bp windows of 12 functional human lincRNAs are indicated for comparison.

**Figure 4. LincRNAs are enriched for trait-associated SNPs.**
The number of trait-associated SNPs within RefSeq NM gene exons, lincRNA exons, or background loci (nonexpressed intergenic sequence) per tested SNP in genome wide association studies is compared (see Methods). *P = 0.0173, **P<2.2E-16; error bars represent 95% binomial proportion confidence interval.

See this image and copyright information in PMC

References

1. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106: 9362–9367. - PMC - PubMed
1. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, et al. (2004) Global identification of human transcribed sequences with genome tiling arrays. Science 306: 2242–2246. - PubMed
1. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, et al. (2005) Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308: 1149–1154. - PubMed
1. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, et al. (2002) Large-scale transcriptional activity in chromosomes 21 and 22. Science 296: 916–919. - PubMed
1. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, et al. (2007) RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316: 1484–1488. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs

Affiliation

Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources