Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul;37(7):803-809.
doi: 10.1038/s41587-019-0164-5. Epub 2019 Jul 1.

Human 5' UTR design and variant effect prediction from a massively parallel translation assay

Affiliations

Human 5' UTR design and variant effect prediction from a massively parallel translation assay

Paul J Sample et al. Nat Biotechnol. 2019 Jul.

Abstract

The ability to predict the impact of cis-regulatory sequences on gene expression would facilitate discovery in fundamental and applied biology. Here we combine polysome profiling of a library of 280,000 randomized 5' untranslated regions (UTRs) with deep learning to build a predictive model that relates human 5' UTR sequence to translation. Together with a genetic algorithm, we use the model to engineer new 5' UTRs that accurately direct specified levels of ribosome loading, providing the ability to tune sequences for optimal protein expression. We show that the same approach can be extended to chemically modified RNA, an important feature for applications in mRNA therapeutics and synthetic biology. We test 35,212 truncated human 5' UTRs and 3,577 naturally occurring variants and show that the model predicts ribosome loading of these sequences. Finally, we provide evidence of 45 single-nucleotide variants (SNVs) associated with human diseases that substantially change ribosome loading and thus may represent a molecular basis for disease.

PubMed Disclaimer

Conflict of interest statement

Competing interests

PJS, BW, GS, and DRM declare no competing interests. DR, VP, and IM are employees and shareholders of Moderna Therapeutics.

Figures

Figure 1.
Figure 1.
A library of 280,000 random 50-mers as 5′ UTRs for eGFP. (a) A 5′ UTR model capable of predicting translation from sequence is used to evaluate the effect of 5′ UTR SNVs and to engineer new sequences for optimal protein expression. (b) A library of 280,000 members was built by inserting a T7 promoter followed by 25 nt of defined 5′ UTR sequence, a random 50-mer, and the eGFP coding sequence into a plasmid backbone. Library IVT mRNA was produced by in vitro transcription from a linearized DNA template obtained through PCR from the plasmid library. Cells transfected with library IVT mRNA were grown for 12 hours before polysome profiling. Read counts per fraction were used to calculate Mean Ribosome Loads (MRL) for each UTR and the resulting data were used to train a convolutional neural network (CNN). (c) Out-of-frame upstream AUGs (uAUGs) reduce ribosome loading (vertical lines indicate positions that are in-frame with the eGFP CDS). A similar but much weaker periodicity is observed for CUG and GUG. (d) The repressive strength of all out-of-frame variations of NNNATGNN. (e) Nucleotide frequencies were calculated for the 20 most repressive (‘strong’) and least repressive (‘weak’) TIS sequences.
Figure 2
Figure 2. Modeling 5′ UTR sequences and ribosome loading.
(a) Optimus 5-Prime structure: A one-hot encoded 5′ UTR sequence is fed into a CNN composed of three convolution layers and a fully connected layer to produce a linear output predicting MRL. (b) Optimus 5-Prime trained on 260,000 UTRs and tested on 20,000 held-out sequences could explain 93% of the variability in observed MRLs. Blue dots represent sequences with an uAUG while red dots represent sequences without uAUG (n = 20,000). (c) A similar model was trained to predict the polysome profile distribution of an individual 5′ UTR. The observed (blue) and predicted (red) polysome distribution of 5 random picked example UTRs out of 20,000 in the test set spanning MRLs from 4 to 8 (top to bottom) are shown. (d) The performance of the polysome profile model per fraction ranged from an r2 of 0.621 to 0.915 and an average of 0.834 across all fractions (n = 20,000). (e) eGFP expression for ten UTRs selected from the library were evaluated via eGFP fluorescence using IncuCyte live cell imaging (n = 3, centers are the means, error bars are s.e.m.). Predicted MRL and fluorescence are highly correlated (r2: 0.87, n = 10). For details, see Supplementary Table 2. (f) Visualization of four out of 120 filters from the first convolution layer (left) and four out of 120 filters from the second convolution layer. Boxes below show correlation (Pearson r) between filter activation and MRL at each UTR position. Filters learned important regulatory motifs such as start and stop codons, uORFs, and GC-rich regions likely involved in secondary structure formation. (g) IVT mRNA from the eGFP library were generated with pseudouridine (Ψ) or 1-methylpseudouridine (m1 Ψ) in place of uridine (U) and evaluated by polysome profiling and modeling. (h) Model performance trained and tested on different data sets (r-squared). The unmodified RNA (U) models perform best with U data, while the Ψ and m1 Ψ models perform equally well with Ψ and m1 Ψ test data (n = 20,000). (i) Ribosome loading as a function of MFE. U is less dependent on secondary structure than Ψ and m1 Ψ (Pearson r: 0.43, 0.56, and 0.58, respectively. n = 19,976).
Figure 3
Figure 3. Design of new 5′ UTRs.
(a) Diagram of a genetic algorithm that was used in conjunction with Optimus 5-Prime to evolve sequences to target specific levels of ribosome loading. (b) Comparison between the predicted MRLs and observed MRLs for evolved 5′ UTRs for targeted ribosome loading. All 16 box plots are defined in terms of the sample size, minima, median, maxima and percentiles (Supplementary Table 3). (c) Step-wise evolution analysis. Randomly initialized UTRs were first evolved for low ribosome loading and then for high ribosome (selection change at dashed line). Four out of 80 (Supplementary Fig. 11a–d) examples are shown. Examples on the left were permitted to have uAUGs while those on the right were not. Each unique sequence that was generated during the evolution process was synthesized and tested by polysome profiling. The original Optimus 5-Prime prediction (green) and the observed MRL eventually diverge, but the predictions from the retrained Optimus 5-Prime (red) more accurately reflect the data. (d) The original Optimus 5-Prime is retrained using sequences from the designed library with high poly-U, C, A, and G stretches which occur rarely in the random library. (e) The accuracy of the retrained Optimus 5-Prime increased when predicting the high poly-U sequences (red) generated by the genetic algorithm (r2: 0.386 to 0.772, n = 2,146).
Figure 4
Figure 4. Model performance with human 5′ UTRs and generalization to varying lengths 5′ UTRs.
(a) The first 50 nucleotides preceding the CDS of 35,212 human transcripts and an additional 3,577 UTRs with SNVs (ClinVar) were evaluated using our polysome profiling method with eGFP used as the CDS. The retrained Optimus 5-Prime could explain 81.1% of the observed variation in MRL (n = 25,000). (b) The log2 change in MRL between an SNV and its common sequence was compared to the predicted change between the two (r2: 0.555, n = 1,597). SNV classification labels are from the ClinVar database. (c) In silico saturation mutagenesis and model prediction of MRL change for all 5’ UTR variants of CPOX, TMEM127 and RPL5. The three annotated Clinvar variants, rs867711777 (CPOX, G > A), rs121908813 (TMEM127, C > U), and rs376208311 (RPL5, C > A), are predicted to have the most dramatic effect on ribosome loading. (d) A library of 76,319 random 5’UTRs with varying lengths from 25 to 100 nucleotides was used to train the generalized Optimus 5-Prime. Sequences are one-hot encoded and zero padded to 100 nucleotides long if shorter than 100. (e) 7,600 random (blue dots) and 7,600 human (red dots) sequences are tested using the generalized Optimus 5-Prime. 100 sequences of each length (25–100) are represented. Model accuracy (r2: 0.754 to 0.838) is shown in predicting MRLs on different range of lengths of 5’ UTRs (From left to right: n = 4,000; n = 4,000; n = 4,000; n = 3,200.).

References

    1. Araujo PR et al. Before it gets started: Regulating translation at the 5′; UTR. Comparative and Functional Genomics (2012). doi:10.1155/2012/475731 - DOI - PMC - PubMed
    1. Jackson RJ, Hellen CUT & Pestova TV The mechanism of eukaryotic translation initiation and principles of its regulation. Nature Reviews Molecular Cell Biology (2010). doi:10.1038/nrm2838 - DOI - PMC - PubMed
    1. Angermueller C, Pärnamaa T, Parts L & Stegle O Deep learning for computational biology. Mol. Syst. Biol (2016). doi:10.15252/msb.20156651 - DOI - PMC - PubMed
    1. Alipanahi B, Delong A, Weirauch MT & Frey BJ Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol (2015). doi:10.1038/nbt.3300 - DOI - PubMed
    1. Zhou J & Troyanskaya OG Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods (2015). doi:10.1038/nmeth.3547 - DOI - PMC - PubMed

Methods-only References

    1. Richner JM et al. Vaccine Mediated Protection Against Zika Virus-Induced Congenital Disease. Cell (2017). doi:10.1016/j.cell.2017.06.040 - DOI - PMC - PubMed
    1. Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal (2011). doi:10.14806/ej.17.1.200 - DOI
    1. Zhao L, Liu Z, Levy SF & Wu S Bartender: a fast and accurate clustering algorithm to count barcode reads. Bioinformatics (2017). doi:10.1093/bioinformatics/btx655 - DOI - PMC - PubMed
    1. Chollet F Keras (2015). URL http://keras.io (2017).
    1. Abadi M et al. TensorFlow : A System for Large-Scale Machine Learning This paper is included in the Proceedings of the TensorFlow : A system for large-scale machine learning. Proc 12th USENIX Conf. Oper. Syst. Des. Implement (2016). doi:10.1126/science.aab4113.4 - DOI

Publication types