. 2019 Jul;37(7):803-809.

doi: 10.1038/s41587-019-0164-5. Epub 2019 Jul 1.

Human 5' UTR design and variant effect prediction from a massively parallel translation assay

Paul J Sample^#¹, Ban Wang^#¹, David W Reid², Vlad Presnyak², Iain J McFadyen², David R Morris³, Georg Seelig^{4

5}

Affiliations

¹ Department of Electrical Engineering, University of Washington, Seattle, WA, USA.
² Moderna, Cambridge, MA, USA.
³ Department of Biochemistry, University of Washington, Seattle, WA, USA.
⁴ Department of Electrical Engineering, University of Washington, Seattle, WA, USA. gseelig@uw.edu.
⁵ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA. gseelig@uw.edu.

^# Contributed equally.

PMID: 31267113
PMCID: PMC7100133
DOI: 10.1038/s41587-019-0164-5

Human 5' UTR design and variant effect prediction from a massively parallel translation assay

Paul J Sample et al. Nat Biotechnol. 2019 Jul.

. 2019 Jul;37(7):803-809.

doi: 10.1038/s41587-019-0164-5. Epub 2019 Jul 1.

Authors

Paul J Sample^#¹, Ban Wang^#¹, David W Reid², Vlad Presnyak², Iain J McFadyen², David R Morris³, Georg Seelig^{4

5}

Affiliations

¹ Department of Electrical Engineering, University of Washington, Seattle, WA, USA.
² Moderna, Cambridge, MA, USA.
³ Department of Biochemistry, University of Washington, Seattle, WA, USA.
⁴ Department of Electrical Engineering, University of Washington, Seattle, WA, USA. gseelig@uw.edu.
⁵ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA. gseelig@uw.edu.

^# Contributed equally.

PMID: 31267113
PMCID: PMC7100133
DOI: 10.1038/s41587-019-0164-5

Abstract

The ability to predict the impact of cis-regulatory sequences on gene expression would facilitate discovery in fundamental and applied biology. Here we combine polysome profiling of a library of 280,000 randomized 5' untranslated regions (UTRs) with deep learning to build a predictive model that relates human 5' UTR sequence to translation. Together with a genetic algorithm, we use the model to engineer new 5' UTRs that accurately direct specified levels of ribosome loading, providing the ability to tune sequences for optimal protein expression. We show that the same approach can be extended to chemically modified RNA, an important feature for applications in mRNA therapeutics and synthetic biology. We test 35,212 truncated human 5' UTRs and 3,577 naturally occurring variants and show that the model predicts ribosome loading of these sequences. Finally, we provide evidence of 45 single-nucleotide variants (SNVs) associated with human diseases that substantially change ribosome loading and thus may represent a molecular basis for disease.

PubMed Disclaimer

Conflict of interest statement

Competing interests

PJS, BW, GS, and DRM declare no competing interests. DR, VP, and IM are employees and shareholders of Moderna Therapeutics.

Figures

**Figure 1.**
A library of 280,000 random 50-mers as 5′ UTRs for eGFP. **(a)** A 5′ UTR model capable of predicting translation from sequence is used to evaluate the effect of 5′ UTR SNVs and to engineer new sequences for optimal protein expression. **(b)** A library of 280,000 members was built by inserting a T7 promoter followed by 25 nt of defined 5′ UTR sequence, a random 50-mer, and the eGFP coding sequence into a plasmid backbone. Library IVT mRNA was produced by *in vitro* transcription from a linearized DNA template obtained through PCR from the plasmid library. Cells transfected with library IVT mRNA were grown for 12 hours before polysome profiling. Read counts per fraction were used to calculate Mean Ribosome Loads (MRL) for each UTR and the resulting data were used to train a convolutional neural network (CNN). **(c)** Out-of-frame upstream AUGs (uAUGs) reduce ribosome loading (vertical lines indicate positions that are in-frame with the eGFP CDS). A similar but much weaker periodicity is observed for CUG and GUG. **(d)** The repressive strength of all out-of-frame variations of NNNATGNN. **(e)** Nucleotide frequencies were calculated for the 20 most repressive (‘strong’) and least repressive (‘weak’) TIS sequences.

**Figure 2. Modeling 5′ UTR sequences and ribosome loading.**
**(a)** Optimus 5-Prime structure: A one-hot encoded 5′ UTR sequence is fed into a CNN composed of three convolution layers and a fully connected layer to produce a linear output predicting MRL. **(b)** Optimus 5-Prime trained on 260,000 UTRs and tested on 20,000 held-out sequences could explain 93% of the variability in observed MRLs. Blue dots represent sequences with an uAUG while red dots represent sequences without uAUG (n = 20,000). **(c)** A similar model was trained to predict the polysome profile distribution of an individual 5′ UTR. The observed (blue) and predicted (red) polysome distribution of 5 random picked example UTRs out of 20,000 in the test set spanning MRLs from 4 to 8 (top to bottom) are shown. **(d)** The performance of the polysome profile model per fraction ranged from an r² of 0.621 to 0.915 and an average of 0.834 across all fractions (n = 20,000). **(e)** eGFP expression for ten UTRs selected from the library were evaluated via eGFP fluorescence using IncuCyte live cell imaging (n = 3, centers are the means, error bars are s.e.m.). Predicted MRL and fluorescence are highly correlated (r²: 0.87, n = 10). For details, see Supplementary Table 2. **(f)** Visualization of four out of 120 filters from the first convolution layer (left) and four out of 120 filters from the second convolution layer. Boxes below show correlation (Pearson r) between filter activation and MRL at each UTR position. Filters learned important regulatory motifs such as start and stop codons, uORFs, and GC-rich regions likely involved in secondary structure formation. **(g)** IVT mRNA from the eGFP library were generated with pseudouridine (Ψ) or 1-methylpseudouridine (m¹ Ψ) in place of uridine (U) and evaluated by polysome profiling and modeling. **(h)** Model performance trained and tested on different data sets (r-squared). The unmodified RNA (U) models perform best with U data, while the Ψ and m¹ Ψ models perform equally well with Ψ and m¹ Ψ test data (n = 20,000). **(i)** Ribosome loading as a function of MFE. U is less dependent on secondary structure than Ψ and m¹ Ψ (Pearson r: 0.43, 0.56, and 0.58, respectively. n = 19,976).

**Figure 3. Design of new 5′ UTRs.**
**(a)** Diagram of a genetic algorithm that was used in conjunction with Optimus 5-Prime to evolve sequences to target specific levels of ribosome loading. **(b)** Comparison between the predicted MRLs and observed MRLs for evolved 5′ UTRs for targeted ribosome loading. All 16 box plots are defined in terms of the sample size, minima, median, maxima and percentiles (Supplementary Table 3). **(c)** Step-wise evolution analysis. Randomly initialized UTRs were first evolved for low ribosome loading and then for high ribosome (selection change at dashed line). Four out of 80 (Supplementary Fig. 11a–d) examples are shown. Examples on the left were permitted to have uAUGs while those on the right were not. Each unique sequence that was generated during the evolution process was synthesized and tested by polysome profiling. The original Optimus 5-Prime prediction (green) and the observed MRL eventually diverge, but the predictions from the retrained Optimus 5-Prime (red) more accurately reflect the data. **(d)** The original Optimus 5-Prime is retrained using sequences from the designed library with high poly-U, C, A, and G stretches which occur rarely in the random library. **(e)** The accuracy of the retrained Optimus 5-Prime increased when predicting the high poly-U sequences (red) generated by the genetic algorithm (r²: 0.386 to 0.772, n = 2,146).

**Figure 4. Model performance with human 5′ UTRs and generalization to varying lengths 5′ UTRs.**
**(a)** The first 50 nucleotides preceding the CDS of 35,212 human transcripts and an additional 3,577 UTRs with SNVs (ClinVar) were evaluated using our polysome profiling method with eGFP used as the CDS. The retrained Optimus 5-Prime could explain 81.1% of the observed variation in MRL (n = 25,000). **(b)** The log₂ change in MRL between an SNV and its common sequence was compared to the predicted change between the two (r²: 0.555, n = 1,597). SNV classification labels are from the ClinVar database. **(c)** *In silico* saturation mutagenesis and model prediction of MRL change for all 5’ UTR variants of CPOX, TMEM127 and RPL5. The three annotated Clinvar variants, rs867711777 (CPOX, G > A), rs121908813 (TMEM127, C > U), and rs376208311 (RPL5, C > A), are predicted to have the most dramatic effect on ribosome loading. **(d)** A library of 76,319 random 5’UTRs with varying lengths from 25 to 100 nucleotides was used to train the generalized Optimus 5-Prime. Sequences are one-hot encoded and zero padded to 100 nucleotides long if shorter than 100. **(e)** 7,600 random (blue dots) and 7,600 human (red dots) sequences are tested using the generalized Optimus 5-Prime. 100 sequences of each length (25–100) are represented. Model accuracy (r²: 0.754 to 0.838) is shown in predicting MRLs on different range of lengths of 5’ UTRs (From left to right: n = 4,000; n = 4,000; n = 4,000; n = 3,200.).

See this image and copyright information in PMC

References

1. Araujo PR et al. Before it gets started: Regulating translation at the 5′; UTR. Comparative and Functional Genomics (2012). doi:10.1155/2012/475731 - DOI - PMC - PubMed
1. Jackson RJ, Hellen CUT & Pestova TV The mechanism of eukaryotic translation initiation and principles of its regulation. Nature Reviews Molecular Cell Biology (2010). doi:10.1038/nrm2838 - DOI - PMC - PubMed
1. Angermueller C, Pärnamaa T, Parts L & Stegle O Deep learning for computational biology. Mol. Syst. Biol (2016). doi:10.15252/msb.20156651 - DOI - PMC - PubMed
1. Alipanahi B, Delong A, Weirauch MT & Frey BJ Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol (2015). doi:10.1038/nbt.3300 - DOI - PubMed
1. Zhou J & Troyanskaya OG Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods (2015). doi:10.1038/nmeth.3547 - DOI - PMC - PubMed

Methods-only References

1. Richner JM et al. Vaccine Mediated Protection Against Zika Virus-Induced Congenital Disease. Cell (2017). doi:10.1016/j.cell.2017.06.040 - DOI - PMC - PubMed
1. Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal (2011). doi:10.14806/ej.17.1.200 - DOI
1. Zhao L, Liu Z, Levy SF & Wu S Bartender: a fast and accurate clustering algorithm to count barcode reads. Bioinformatics (2017). doi:10.1093/bioinformatics/btx655 - DOI - PMC - PubMed
1. Chollet F Keras (2015). URL http://keras.io (2017).
1. Abadi M et al. TensorFlow : A System for Large-Scale Machine Learning This paper is included in the Proceedings of the TensorFlow : A system for large-scale machine learning. Proc 12th USENIX Conf. Oper. Syst. Des. Implement (2016). doi:10.1126/science.aab4113.4 - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Human 5' UTR design and variant effect prediction from a massively parallel translation assay

Affiliations

Human 5' UTR design and variant effect prediction from a massively parallel translation assay

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Methods-only References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases