Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Dec 3;60(5):816-827.
doi: 10.1016/j.molcel.2015.11.013.

A Regression-Based Analysis of Ribosome-Profiling Data Reveals a Conserved Complexity to Mammalian Translation

Affiliations

A Regression-Based Analysis of Ribosome-Profiling Data Reveals a Conserved Complexity to Mammalian Translation

Alexander P Fields et al. Mol Cell. .

Abstract

A fundamental goal of genomics is to identify the complete set of expressed proteins. Automated annotation strategies rely on assumptions about protein-coding sequences (CDSs), e.g., they are conserved, do not overlap, and exceed a minimum length. However, an increasing number of newly discovered proteins violate these rules. Here we present an experimental and analytical framework, based on ribosome profiling and linear regression, for systematic identification and quantification of translation. Application of this approach to lipopolysaccharide-stimulated mouse dendritic cells and HCMV-infected human fibroblasts identifies thousands of novel CDSs, including micropeptides and variants of known proteins, that bear the hallmarks of canonical translation and exhibit translation levels and dynamics comparable to that of annotated CDSs. Remarkably, many translation events are identified in both mouse and human cells even when the peptide sequence is not conserved. Our work thus reveals an unexpected complexity to mammalian translation suited to provide both conserved regulatory or protein-based functions.

PubMed Disclaimer

Figures

Figure 1
Figure 1. ORF-RATER identifies translated ORFs comprehensively in mouse BMDCs
(A) Naïve BMDCs were isolated and stimulated with LPS for up to 12 hours. Ribosome profiling data sets were collected at the nine times indicated prior to or during stimulation, and mass spectrometry data sets were collected at 0, 2, 6, and 12 hours. (B) The average read density (“metagene”) profiles of the four BMDC ribosome profiling datasets near annotated start codons (left), at the center of annotated CDSs (center), and near annotated stop codons (right) reveal features of translation highlighted by each treatment (harringtonine [Harr], lactimidomycin [LTM], cycloheximide [CHX], or no-drug [ND]). The highlighted green and red regions indicate annotated start and stop codons, respectively. (C) Top, observed RPFs within the annotated CDS of chemokine ligand 17 (Ccl17). Ribosome density at an AUG codon 10 codons downstream from the canonical AUG following Harr treatment suggests that a truncated form lacking the N-terminal 10 amino acids may be translated in addition to the canonical form. Bottom, linear regression of the observed RPFs in the ND condition against the expected profiles of the two candidate ORFs suggests that both may be translated. (D) The ORF-RATER pipeline globally evaluates translation. NUG-initiated ORFs are identified from transcript sequences assembled from BMDC RNA-seq data and the Ensembl and UCSC Known Genes databases. After removing ORFs whose translation initiation sites lack ribosome density following Harr or LTM treatment, the remaining ORFs are analyzed by linear regression (C), the results of which are assayed for significance using a random forest classifier. See Figure S1 for the full distribution of scores.
Figure 2
Figure 2. Previously unannotated translated CDSs in BMDCs fall into several classes, each of which displays patterns consistent with active translation
(A) ORF-RATER identifies 13,075 high-confidence translated ORFs. The majority of these are previously annotated CDSs, and the majority of the remainder are variants of canonical CDSs that share portions of the coding sequence. ORFs distinct from annotated CDSs occur primarily in 5′ UTRs, though a sizable subset are found on transcripts without previously appreciated coding potential or in alternate frames of canonical CDSs. See Figure S1A for the distribution of ORF-RATER scores for each type, and Table S1 for a complete list of all high-confidence CDSs. (B) Metagene profiles of each class of new CDS display the hallmarks of translation, including peaks of density at newly identified start codons following Harr treatment, peaks of density at stop codons under ND treatment, and greater read density in between. Translated truncations (top left) and extensions (top right) display peaks of density at both the canonical and novel translation initiation sites, suggesting that both are used on average. The average read density in all translated regions show 3-nucleotide periodicity in the expected reading frame, with the exception of internal CDSs, for which the reading frame is on average a superposition of the canonical and alternative frames. Metagene profiles for the LTM and CHX datasets are plotted in Figure S2.
Figure 3
Figure 3. Novel CDSs include many short ORFs and variants of canonical proteins missed by prior annotations
(A) Compared to the distribution of ORF sizes on real or scrambled transcripts, translated CDSs are highly enriched for long ORFs, but to a lesser extent than prior annotations. (B) Nearly all short translated CDSs are distinct from canonical proteins, and nearly all long translated CDSs are canonical proteins or their variants. (C) Length of extended (left) or truncated (right) regions is plotted as a function of the length of the canonical protein. Cumulative distributions are plotted to the right or above each scatter plot. For truncated CDSs, the dashed green line indicates the position beyond which the entire CDS would be removed.
Figure 4
Figure 4. Novel CDSs are translated at similar levels and with similar dynamics to annotated CDSs in response to LPS stimulation
(A) Cumulative distributions of translation rates for each class of translated CDS. (B) Cumulative distributions of maximal fold-change across the time course of LPS stimulation. (C) Hierarchically clustered heat map of dynamically regulated CDSs showing translation rates at indicated intervals of LPS stimulation. Each row represents one CDS. Three highlighted clusters show CDSs whose translation is maximal at early (top), intermediate (center), or late (bottom) time points. Each cluster contains a mixture of novel and annotated CDSs, indicated by the colored lines at right. RPKM values for all CDSs are included in Table S1. (D) GO term enrichments for the annotated genes contained in the three clusters highlighted in (C).
Figure 5
Figure 5. Many novel translated CDSs are seen in both human and mouse cells
(A) Sites of productive translation initiation in both mouse BMDCs and HFFs encode proteins of similar length regardless of whether the protein had been previously annotated. Many of these previously unannotated proteins do not appear to be conserved at the level of protein sequence (Figure S3A). (B) Many loci encode multiple corresponding CDSs in both mice and humans. (C) RPF density at the Socs1 locus in mouse BMDCs (top) and HFFs (bottom) show similar organization of translated ORFs. (D) Zdhhc3 encodes four CDSs translated in both mouse BMDCs and HFFs. The longest translated uORF does not appear to be conserved for protein function despite being translated in both species; its multiple sequence alignment is included in Figure S3B. Reporter constructs indicate that the AUGs upstream of the one initiating the truncated form of Zdhhc3—including the canonical start codon—are repressive (Figure S4A).
Figure 6
Figure 6. A significant subset of novel CDSs display signatures of codon-level conservation
(A) For each threshold value, the number of novel CDSs of each type whose PhyloCSF score exceeds that threshold is plotted. PhyloCSF scores are calculated for only those codons non-overlapping with canonical CDSs. Scores indicate the log-likelihood that the ancestral locus was protein-coding; values of 10 or 20 correspond to 10:1 or 100:1 likelihood, respectively. The legend indicates the total number of ORFs for which a sequence alignment could be obtained, including those assigned negative PhyloCSF scores. (B) Cumulative distributions of per-codon PhyloCSF scores for translated uORFs and extensions of canonical CDSs. In both cases, PhyloCSF scores are significantly greater at translated CDSs relative to non-translated CDSs of the same type. Intergenic ORFs receive significantly lower scores and serve as negative controls. Because PhyloCSF scores vary linearly with the length of the sequence alignment, when comparing ORFs of different sizes, each score is normalized by the number of codons considered. See also Figure S3A. (C) RPF density at the mouse BC029722 (top) and human MMP24-AS1 (bottom) genes show translation of a previously unannotated CDS that is highly conserved phylogenetically. The multiple sequence alignment is shown in Figure S6A. A C-terminal eGFP fusion of human MMP24-AS1 was found to localize to the ER and Golgi apparatus (Figure S6B). (D) The Thp5 gene encodes a previously unannotated, conserved 68-amino acid protein in both mouse (top) and human (bottom). Two peptides from the mouse protein are identified by MS; full peptide and protein MS results are listed in Tables S2 and S3, and quality metrics are plotted in Figure S5. (E) Translation initiation of the Fxr2 gene occurs at an upstream GUG codon in both mouse BMDCs (top) and HFFs (bottom). In both cases, the canonical AUG initiation site appears to be unused. The translated region upstream of the canonical AUG appears to be highly conserved, and encodes multiple peptides detected by MS (peptide sequences highlighted in orange and blue). Translation initiation of Fxr2 via a GUG codon was confirmed via transient transfection with fluorescent reporter constructs (Figure S4B).

References

    1. Acland P, Dixon M, Peters G, Dickson C. Subcellular fate of the lnt-2 oncoprotein is determined by choice of initiation codon. Nature. 1990;343:662–665. - PubMed
    1. Anderson DM, Anderson KM, Chang C, Makarewich CA, Nelson BR, McAnally JR, Kasaragod P, Shelton JM, Liou J, Bassel-Duby R. A micropeptide encoded by a putative long noncoding RNA regulates muscle performance. Cell. 2015;160:595–606. - PMC - PubMed
    1. Andreev DE, O'Connor PB, Fahey C, Kenny EM, Terenin IM, Dmitriev SE, Cormican P, Morris DW, Shatsky IN, Baranov PV. Translation of 5′ leaders is pervasive in genes resistant to eIF2 repression. Elife. 2015;4:e03971. - PMC - PubMed
    1. Andrews SJ, Rothnagel JA. Emerging evidence for functional peptides encoded by short open reading frames. Nat Rev Genet. 2014;15:193–204. - PubMed
    1. Aspden JL, Eyre-Walker YC, Phillips RJ, Amin U, Mumtaz MAS, Brocard M, Couso J. Extensive translation of small open reading frames revealed by Poly-Ribo-Seq. Elife. 2014;3:e03528. - PMC - PubMed

Publication types

Associated data