Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep;22(9):1646-57.
doi: 10.1101/gr.134767.111.

Long noncoding RNAs are rarely translated in two human cell lines

Affiliations

Long noncoding RNAs are rarely translated in two human cell lines

Balázs Bánfai et al. Genome Res. 2012 Sep.

Abstract

Data from the Encyclopedia of DNA Elements (ENCODE) project show over 9640 human genome loci classified as long noncoding RNAs (lncRNAs), yet only ~100 have been deeply characterized to determine their role in the cell. To measure the protein-coding output from these RNAs, we jointly analyzed two recent data sets produced in the ENCODE project: tandem mass spectrometry (MS/MS) data mapping expressed peptides to their encoding genomic loci, and RNA-seq data generated by ENCODE in long polyA+ and polyA- fractions in the cell lines K562 and GM12878. We used the machine-learning algorithm RuleFit3 to regress the peptide data against RNA expression data. The most important covariate for predicting translation was, surprisingly, the Cytosol polyA- fraction in both cell lines. LncRNAs are ~13-fold less likely to produce detectable peptides than similar mRNAs, indicating that ~92% of GENCODE v7 lncRNAs are not translated in these two ENCODE cell lines. Intersecting 9640 lncRNA loci with 79,333 peptides yielded 85 unique peptides matching 69 lncRNAs. Most cases were due to a coding transcript misannotated as lncRNA. Two exceptions were an unprocessed pseudogene and a bona fide lncRNA gene, both with open reading frames (ORFs) compromised by upstream stop codons. All potentially translatable lncRNA ORFs had only a single peptide match, indicating low protein abundance and/or false-positive peptide matches. We conclude that with very few exceptions, ribosomes are able to distinguish coding from noncoding transcripts and, hence, that ectopic translation and cryptic mRNAs are rare in the human lncRNAome.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Expression levels are correlated with peptide detectability via MS/MS. Peptide detectability (y-axis) as a function of RNA expression levels (RPKM, x-axis) in GM12878 (A) and K562 (B) whole-cell RNA samples. We identified peptides for only 1% of genes expressed at RPKM <0.1, whereas we detected peptides for ∼40% of genes expressed above RPKM 100. In general, the likelihood of detection rises as expression level rises.
Figure 2.
Figure 2.
Visualizing some of the properties of the model of RNA-seq and MS/MS data. (A,B) Relative importance of each of the covariates (RNA fractions). (C) Relative partial dependence of the likelihood of detecting at least one uniquely mapping peptide on the polyA+ Whole Cell and polyA− Cytosol fractions from GM12878. This is known as a “partial dependence plot.” We note that detectable polyA− Cytosol expression is nearly a prerequisite to detecting uniquely mapping peptides, even when polyA+ Whole Cell expression is extremely high. (D) Partial dependence plot for the total RNA nucleolus and chromatin fractions from K562. (E,F) “Interaction strength plots.” These show the relative importance of considering the dependence between pairs of covariates (fractions) in the overall predictive model. (Red bars) Standard deviation under the null of no association.
Figure 3.
Figure 3.
Manual annotation of non-singleton peptides aligning to GENCODE lncRNA exons: case studies. (A–C) An untranslatable non-singleton peptide encoded downstream from an unprocessed pseudogene. (A) BL2SEQ TBLASTN peptide-to-lncRNA alignment. (B) NCBI ORF Finder view of the translation containing this peptide and upstream stops. (C) UCSC Genome Browser view of the peptide (red box). Direction is positive strand (data not shown). (D–F) An untranslatable non-singleton peptide encoded by a standalone lncRNA exon. (D) BL2SEQ TBLASTN peptide-to-lncRNA alignment. (E) NCBI ORF Finder view of the translation containing this peptide and upstream stops. (F) UCSC Genome Browser view of the peptide (red box) encoded on the negative strand of the genome.
Figure 4.
Figure 4.
Manual annotation of translatable peptides aligning to GENCODE lncRNA exons: case studies. (A–C) A translatable singleton peptide encoded at a bona fide lncRNA locus. (A) BL2SEQ TBLASTN peptide-to-lncRNA alignment. (B) NCBI ORF Finder view of the translation containing this peptide (highlighted) including its furthest upstream ATG and its downstream stop (red rectangles). (C) UCSC Genome Browser view of the peptide (red box). Direction is negative strand. The singleton peptide is encoded by exon 2 of a GENCODE lncRNA that is divergently transcribed in the antisense orientation relative to the known gene CACNA1G. (D,E) A translatable non-singleton peptide traceable to a GENCODE misclassification of a protein-coding transcript of the EMG1 gene that had been assigned an lncRNA biotype. (D) Three peptides (red) that are in-frame to the EMG1 known protein (full-length shown) but are assigned to the GENCODE lncRNA ENST00000439543.2. (E) UCSC Genome Browser view of the peptide (red box). Direction is positive strand. The lncRNA is a noncoding transcript from the coding EMG1 locus. However, the peptides correspond to known coding exons of the EMG1 RefSeq, not solely to exons of the noncoding transcript. (E1) Peptide matches the common coding mRNA exon, not a unique exon of the lncRNA (this is true in all cases). (E2) GENCODE v7 lncRNA match.

Similar articles

Cited by

References

    1. Affymetrix/Cold Spring Harbor Laboratory ENCODE Transcriptome Project 2009. Post-transcriptional processing generates a diversity of 5′-modified long and short RNAs. Nature 457: 1028–1032 - PMC - PubMed
    1. Breiman L 2001. Random Forests. Mach Learn 45: 5–32
    1. Brosius J, Gould SJ 1992. On “genomenclature”: A comprehensive (and respectful) taxonomy for pseudogenes and other “junk DNA.” Proc Natl Acad Sci 89: 10706–10710 - PMC - PubMed
    1. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, et al. 2005. The transcriptional landscape of the mammalian genome. Science 309: 1559–1563 - PubMed
    1. Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP 2008. Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci 105: 21034–21038 - PMC - PubMed

Publication types

Associated data

LinkOut - more resources