Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Feb 28;6(2):180-191.e4.
doi: 10.1016/j.cels.2017.12.007. Epub 2018 Jan 17.

Scikit-ribo Enables Accurate Estimation and Robust Modeling of Translation Dynamics at Codon Resolution

Affiliations

Scikit-ribo Enables Accurate Estimation and Robust Modeling of Translation Dynamics at Codon Resolution

Han Fang et al. Cell Syst. .

Abstract

Ribosome profiling (Ribo-seq) is a powerful technique for measuring protein translation; however, sampling errors and biological biases are prevalent and poorly understood. Addressing these issues, we present Scikit-ribo (https://github.com/schatzlab/scikit-ribo), an open-source analysis package for accurate genome-wide A-site prediction and translation efficiency (TE) estimation from Ribo-seq and RNA sequencing data. Scikit-ribo accurately identifies A-site locations and reproduces codon elongation rates using several digestion protocols (r = 0.99). Next, we show that the commonly used reads per kilobase of transcript per million mapped reads-derived TE estimation is prone to biases, especially for low-abundance genes. Scikit-ribo introduces a codon-level generalized linear model with ridge penalty that correctly estimates TE, while accommodating variable codon elongation rates and mRNA secondary structure. This corrects the TE errors for over 2,000 genes in S. cerevisiae, which we validate using mass spectrometry of protein abundances (r = 0.81), and allows us to determine the Kozak-like sequence directly from Ribo-seq. We conclude with an analysis of coverage requirements needed for robust codon-level analysis and quantify the artifacts that can occur from cycloheximide treatment.

Keywords: Ribo-seq; bioinformatics; machine learning; statistical method; translation.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Sources of biases using ribosomes densities per mRNA (RPKM-derived TE) as a proxy for TE
(A) Sampling biases towards low abundance genes (left), and biological biases due to paused ribosomes (right). (B) Idealized ribosome footprints distribution without biases (left), or with downstream mRNA secondary structure and low conjugate tRNA availability for the A-site codon (right). (C) Confounding effects of translation initiation and elongation on Riboseq profiles, figure adapted from Quax et al 2013. Initiation rate should be proportional to actual protein yield.
Figure 2
Figure 2. Overview of the analysis workflow in Scikit-ribo
The complete workflow consists of Ribosome A-site classifier training, A-site codon prediction and mapping, and translation efficiency inference. (A) Ribosome A-site training and prediction, gray text boxes denote the major steps. (B) Illustration of the covariates in the codon level generalized linear model. In the model, the mRNA abundance (in TPM) are considered as offset with fixed coefficient equal to one. Codon dwell time and mRNA secondary structure are shared covariates across genes. Translation efficiencies are gene specific covariates.
Figure 3
Figure 3. Accurate inference of codon elongation rates and mRNA secondary structure
(A) Almost perfectly reproduced codon dwell time (DT), inverse of elongation rate) from Weinberg et al (r=0.99). (B) Correlation with the codon’s adaptiveness value (RAV, r=0.5), (C) Correlation with tRNA abundance (r=0.47). In A–C, the gray dashed line denotes the diagonal line; y=x. The RAV scales from 0 to 1. A codon with lower RAV means that it is less optimal for translation elongation, i.e. slower codons. (D) Meta gene analysis of the log ratio of adjusted DT (ADT), divided by the mean adjusted DT. The solid line denotes the average ADT in a five-codon sliding window. A log ratio greater than zero means ribosomes at this position are faster than average. The log ratios on the left were significantly higher than the ones on the right (T-test, p-value= 5 × 10−3). The unit of the distance is codon.
Figure 4
Figure 4. Pair-wise comparisons of estimates between Scikit-ribo and RPKM-derived TE
(A) Scatter plot of Scikit-ribo and RPKM derived log2(TE). Difference in log2(TE): Δ log2(TE). Δ log2(TE) > 0.5, previously underestimated (green), Δ log2(TE) < −0.5, previously overestimated (orange), and other genes in between (gray). The genes with Δ log2(TE) less than −8 are indicated by triangles. (B) Histograms of scikit-ribo and RPKM-derived log2(TE), log2(TE) values less than −10 are adjusted to −10 (C) Histograms of ribosome TPM in all genes (blue), and region 1 (green). (D) Violin plots of Δ log2(TE) by the number stem loops. (E) Violin plots of tAI for genes in the six regions, left: Δ log2(TE) < 0, right: log2(TE) > 0. (F) The Kozak consensus sequence, AAAATGTCT, found with the TE estimates from Scikit-ribo (p-value=1 × 10−21). The lower panel is adapted from the original paper, Hamilton et al (1987).
Figure 5
Figure 5. Large-scale validation with mass spectrometry data confirmed Scikit-ribo’s accurate TE estimates, especially for low-abundance genes
(A) Scikit-ribo derived protein abundance (PA) for all genes in the validation set (r = 0.81, β = 0.83). (B) Scikit-ribo derived PA for genes with TPM less than 100 (r = 0.6, β = 0.48). (C) RPKM-derived PA for all genes in the validation set (r = 0.77, β = 0.75). (D) RPKM-derived PA for genes with TPM less than 100 (r = 0.35, β = 0.29). The black dashed line denotes the identity line; y=x.
Figure 6
Figure 6. Practical considerations of using Scikit-ribo for Riboseq analysis
Pearson correlations between the down-sampled data and the original data (Weinberg et al) on (A) log2(TE), the gray dashed horizontal line denotes Pearson r = 0.95. (B) The same down-sampling comparison for the codon relative dwell time (DT). (C) Scatter plot of log2 TE on Riboseq experiments treated with cycloheximide (CHX) and CHX free data, (D) Same comparison for the codon relative dwell time (DT). The CHX free data is from Weinberg et al, and the CHX-treated Riboseq data is from McManus et al. Both data are in S. cerevisiae. The black dashed line denotes the identity line; y=x.

Similar articles

Cited by

References

    1. Albert FW, Muzzey D, Weissman JS, Kruglyak L. Genetic influences on translation in yeast. PLoS Genet. 2014;10:e1004692. - PMC - PubMed
    1. Archer SK, Shirokikh NE, Beilharz TH, Preiss T. Dynamics of ribosome scanning and recycling revealed by translation complex profiling. Nature. 2016;535:570–574. - PubMed
    1. Balakumar BJ, Fang Han, Hastie Trevor, Friedman Jerome H, Tibshirani Rob, Simon Noah. Glmnet in Python (Zenodo) 2017
    1. Brar GA, Weissman JS. Ribosome profiling reveals the what, when, where and how of protein synthesis. Nat Rev Mol Cell Biol. 2015;16:651–664. - PMC - PubMed
    1. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34:525–527. - PubMed