Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jun;9(6):e1003569.
doi: 10.1371/journal.pgen.1003569. Epub 2013 Jun 20.

Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs

Affiliations

Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs

Matthew J Hangauer et al. PLoS Genet. 2013 Jun.

Abstract

Known protein coding gene exons compose less than 3% of the human genome. The remaining 97% is largely uncharted territory, with only a small fraction characterized. The recent observation of transcription in this intergenic territory has stimulated debate about the extent of intergenic transcription and whether these intergenic RNAs are functional. Here we directly observed with a large set of RNA-seq data covering a wide array of human tissue types that the majority of the genome is indeed transcribed, corroborating recent observations by the ENCODE project. Furthermore, using de novo transcriptome assembly of this RNA-seq data, we found that intergenic regions encode far more long intergenic noncoding RNAs (lincRNAs) than previously described, helping to resolve the discrepancy between the vast amount of observed intergenic transcription and the limited number of previously known lincRNAs. In total, we identified tens of thousands of putative lincRNAs expressed at a minimum of one copy per cell, significantly expanding upon prior lincRNA annotation sets. These lincRNAs are specifically regulated and conserved rather than being the product of transcriptional noise. In addition, lincRNAs are strongly enriched for trait-associated SNPs suggesting a new mechanism by which intergenic trait-associated regions may function. These findings will enable the discovery and interrogation of novel intergenic functional elements.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The human intergenic transcriptome.
(A) 85.2% of the genome has evidence of transcription, with RNA-seq reads mapping directly to 78.9% of genomic sequence. The remaining genomic coverage is comprised of known genes, spliced ESTs and spliced cDNAs. The grey circle represents the portion of the genome (83.4%) that is uniquely mappable with RNA-seq reads. (B) Protein coding (NM gene) exon, intron and intergenic region expression level distribution. Regions that have high expression have a larger fraction of base calls appearing at higher read depths. Protein coding gene exons have the highest proportion of bases with high read depth, while introns and intergenic regions have relatively more bases of low read depth though each contain many highly expressed regions. Base calls = (# of genomic positions at a specific read depth)(read depth). (C) Most intergenic transcription is outside of annotated noncoding RNA genes. The fraction of intergenic base calls within RefSeq noncoding RNA genes (NR genes) compared to other intergenic regions are compared. In (A–C), only uniquely mappable portions of the genome are considered (see Methods).
Figure 2
Figure 2. Discovery of lincRNAs.
(A) Discovery of lincRNAs consisted of de novo assembly of transcripts from RNA-seq data and compilation of annotated and putative noncoding RNAs (see Methods), followed by a series of filters designed to remove all known and novel protein coding transcripts and non-lincRNA noncoding RNAs. Only intergenic noncoding transcripts at least 200 nucleotides in length and expressed at least at one copy per cell were ultimately annotated as lincRNAs. (B) Analysis of ribosomal profiling data reveals that the lincRNA catalog is composed of noncoding transcripts. The maximum 30 bp window ratio of HeLa ribosomal/RNA-seq reads is plotted for exons of lincRNAs, 3′ UTRs and coding sequences (CDS). *P<2.2E-16; whiskers extend +/−1.5 times interquartile range and dots represent outliers. (C) Computational analysis of protein coding capacity of the lincRNAs reveals a lack of protein coding capacity. The cumulative distribution of PhyloCSF scores for lincRNAs and RefSeq NM genes are plotted. Higher scores correspond to higher predicted coding capacity.
Figure 3
Figure 3. LincRNAs possess features inconsistent with transcriptional noise.
(A) ChIP-seq and RNA-seq data from IMR90 cells , were analyzed for lincRNAs and RefSeq NM genes. *P = 4.01E-7, ** P = 4.52E-9, *** P = 2.43E-14, **** P<2.2E-16; P = 0.137 for lincRNAs H3K9me3; whiskers extend to +/−1.5 times interquartile range or most extreme data point. (B) LincRNA FPKM values in polyA+ specific and polyA− specific RNA-seq libraries in H9 ESCs and HeLa cells were compared. Transcripts with RNA-seq reads in all four datasets and with FPKM>1 in at least one of the two fractions for each cell type were analyzed (16,819 NM genes and 127 lincRNAs). Individual lincRNA and NM gene ratios of FPKMs in polyA+/polyA− fractions are plotted. Pearson correlation value for lincRNAs = 0.622 (P = 5.551E-15) and for NM genes = 0.702 (P<2.2E-16). (C) The maximally conserved 50 bp windows in each NM gene, lincRNA, and repetitive element (nonconserved control sequences) were determined. The maximally conserved 50 bp windows of 12 functional human lincRNAs are indicated for comparison.
Figure 4
Figure 4. LincRNAs are enriched for trait-associated SNPs.
The number of trait-associated SNPs within RefSeq NM gene exons, lincRNA exons, or background loci (nonexpressed intergenic sequence) per tested SNP in genome wide association studies is compared (see Methods). *P = 0.0173, **P<2.2E-16; error bars represent 95% binomial proportion confidence interval.

References

    1. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106: 9362–9367. - PMC - PubMed
    1. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, et al. (2004) Global identification of human transcribed sequences with genome tiling arrays. Science 306: 2242–2246. - PubMed
    1. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, et al. (2005) Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308: 1149–1154. - PubMed
    1. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, et al. (2002) Large-scale transcriptional activity in chromosomes 21 and 22. Science 296: 916–919. - PubMed
    1. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, et al. (2007) RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316: 1484–1488. - PubMed

Publication types

MeSH terms

LinkOut - more resources