Long noncoding RNAs are rarely translated in two human cell lines

Balázs Bánfai¹, Hui Jia, Jainab Khatun, Emily Wood, Brian Risk, William E Gundling Jr, Anshul Kundaje, Harsha P Gunawardena, Yanbao Yu, Ling Xie, Krzysztof Krajewski, Brian D Strahl, Xian Chen, Peter Bickel, Morgan C Giddings, James B Brown, Leonard Lipovich

Affiliations

PMID: 22955977
PMCID: PMC3431482
DOI: 10.1101/gr.134767.111

Long noncoding RNAs are rarely translated in two human cell lines

Balázs Bánfai et al. Genome Res. 2012 Sep.

. 2012 Sep;22(9):1646-57.

doi: 10.1101/gr.134767.111.

Authors

Affiliation

¹ Department of Statistics, University of California, Berkeley, California 94720, USA.

PMID: 22955977
PMCID: PMC3431482
DOI: 10.1101/gr.134767.111

Abstract

Data from the Encyclopedia of DNA Elements (ENCODE) project show over 9640 human genome loci classified as long noncoding RNAs (lncRNAs), yet only ~100 have been deeply characterized to determine their role in the cell. To measure the protein-coding output from these RNAs, we jointly analyzed two recent data sets produced in the ENCODE project: tandem mass spectrometry (MS/MS) data mapping expressed peptides to their encoding genomic loci, and RNA-seq data generated by ENCODE in long polyA+ and polyA- fractions in the cell lines K562 and GM12878. We used the machine-learning algorithm RuleFit3 to regress the peptide data against RNA expression data. The most important covariate for predicting translation was, surprisingly, the Cytosol polyA- fraction in both cell lines. LncRNAs are ~13-fold less likely to produce detectable peptides than similar mRNAs, indicating that ~92% of GENCODE v7 lncRNAs are not translated in these two ENCODE cell lines. Intersecting 9640 lncRNA loci with 79,333 peptides yielded 85 unique peptides matching 69 lncRNAs. Most cases were due to a coding transcript misannotated as lncRNA. Two exceptions were an unprocessed pseudogene and a bona fide lncRNA gene, both with open reading frames (ORFs) compromised by upstream stop codons. All potentially translatable lncRNA ORFs had only a single peptide match, indicating low protein abundance and/or false-positive peptide matches. We conclude that with very few exceptions, ribosomes are able to distinguish coding from noncoding transcripts and, hence, that ectopic translation and cryptic mRNAs are rare in the human lncRNAome.

PubMed Disclaimer

Figures

**Figure 1.**
Expression levels are correlated with peptide detectability via MS/MS. Peptide detectability (y-axis) as a function of RNA expression levels (RPKM, x-axis) in GM12878 (A) and K562 (B) whole-cell RNA samples. We identified peptides for only 1% of genes expressed at RPKM <0.1, whereas we detected peptides for ∼40% of genes expressed above RPKM 100. In general, the likelihood of detection rises as expression level rises.

**Figure 2.**
Visualizing some of the properties of the model of RNA-seq and MS/MS data. (A,B) Relative importance of each of the covariates (RNA fractions). (C) Relative partial dependence of the likelihood of detecting at least one uniquely mapping peptide on the polyA+ Whole Cell and polyA− Cytosol fractions from GM12878. This is known as a “partial dependence plot.” We note that detectable polyA− Cytosol expression is nearly a prerequisite to detecting uniquely mapping peptides, even when polyA+ Whole Cell expression is extremely high. (D) Partial dependence plot for the total RNA nucleolus and chromatin fractions from K562. (E,F) “Interaction strength plots.” These show the relative importance of considering the dependence between pairs of covariates (fractions) in the overall predictive model. (Red bars) Standard deviation under the null of no association.

**Figure 3.**
Manual annotation of non-singleton peptides aligning to GENCODE lncRNA exons: case studies. (*A–C*) An untranslatable non-singleton peptide encoded downstream from an unprocessed pseudogene. (A) BL2SEQ TBLASTN peptide-to-lncRNA alignment. (B) NCBI ORF Finder view of the translation containing this peptide and upstream stops. (C) UCSC Genome Browser view of the peptide (red box). Direction is positive strand (data not shown). (*D–F*) An untranslatable non-singleton peptide encoded by a standalone lncRNA exon. (D) BL2SEQ TBLASTN peptide-to-lncRNA alignment. (E) NCBI ORF Finder view of the translation containing this peptide and upstream stops. (F) UCSC Genome Browser view of the peptide (red box) encoded on the negative strand of the genome.

**Figure 4.**
Manual annotation of translatable peptides aligning to GENCODE lncRNA exons: case studies. (*A–C*) A translatable singleton peptide encoded at a bona fide lncRNA locus. (A) BL2SEQ TBLASTN peptide-to-lncRNA alignment. (B) NCBI ORF Finder view of the translation containing this peptide (highlighted) including its furthest upstream ATG and its downstream stop (red rectangles). (C) UCSC Genome Browser view of the peptide (red box). Direction is negative strand. The singleton peptide is encoded by exon 2 of a GENCODE lncRNA that is divergently transcribed in the antisense orientation relative to the known gene *CACNA1G*. (D,E) A translatable non-singleton peptide traceable to a GENCODE misclassification of a protein-coding transcript of the *EMG1* gene that had been assigned an lncRNA biotype. (D) Three peptides (red) that are in-frame to the EMG1 known protein (full-length shown) but are assigned to the GENCODE lncRNA *ENST00000439543.2*. (E) UCSC Genome Browser view of the peptide (red box). Direction is positive strand. The lncRNA is a noncoding transcript from the coding *EMG1* locus. However, the peptides correspond to known coding exons of the *EMG1* RefSeq, not solely to exons of the noncoding transcript. (E1) Peptide matches the common coding mRNA exon, not a unique exon of the lncRNA (this is true in all cases). (E2) GENCODE v7 lncRNA match.

See this image and copyright information in PMC

References

1. Affymetrix/Cold Spring Harbor Laboratory ENCODE Transcriptome Project 2009. Post-transcriptional processing generates a diversity of 5′-modified long and short RNAs. Nature 457: 1028–1032 - PMC - PubMed
1. Breiman L 2001. Random Forests. Mach Learn 45: 5–32
1. Brosius J, Gould SJ 1992. On “genomenclature”: A comprehensive (and respectful) taxonomy for pseudogenes and other “junk DNA.” Proc Natl Acad Sci 89: 10706–10710 - PMC - PubMed
1. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, et al. 2005. The transcriptional landscape of the mammalian genome. Science 309: 1559–1563 - PubMed
1. Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP 2008. Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci 105: 21034–21038 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Associated data

Actions
- Search in PubMed
- Search in GEO

Grants and funding

K99 HG006698/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Long noncoding RNAs are rarely translated in two human cell lines

Affiliation

Long noncoding RNAs are rarely translated in two human cell lines

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources