Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2021 Oct 14:arXiv:2110.07531v2.

Deep learning models for predicting RNA degradation via dual crowdsourcing

Affiliations

Deep learning models for predicting RNA degradation via dual crowdsourcing

Hannah K Wayment-Steele et al. ArXiv. .

Update in

  • Deep learning models for predicting RNA degradation via dual crowdsourcing.
    Wayment-Steele HK, Kladwang W, Watkins AM, Kim DS, Tunguz B, Reade W, Demkin M, Romano J, Wellington-Oguri R, Nicol JJ, Gao J, Onodera K, Fujikawa K, Mao H, Vandewiele G, Tinti M, Steenwinckel B, Ito T, Noumi T, He S, Ishi K, Lee Y, Öztürk F, Chiu KY, Öztürk E, Amer K, Fares M; Eterna Participants; Das R. Wayment-Steele HK, et al. Nat Mach Intell. 2022;4(12):1174-1184. doi: 10.1038/s42256-022-00571-8. Epub 2022 Dec 14. Nat Mach Intell. 2022. PMID: 36567960 Free PMC article.

Abstract

Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a key task in designing more stable RNA-based therapeutics. Here, we describe a crowdsourced machine learning competition ("Stanford OpenVaccine") on Kaggle, involving single-nucleotide resolution measurements on 6043 102-130-nucleotide diverse RNA constructs that were themselves solicited through crowdsourcing on the RNA design platform Eterna. The entire experiment was completed in less than 6 months, and 41% of nucleotide-level predictions from the winning model were within experimental error of the ground truth measurement. Furthermore, these models generalized to blindly predicting orthogonal degradation data on much longer mRNA molecules (504-1588 nucleotides) with improved accuracy compared to previously published models. Top teams integrated natural language processing architectures and data augmentation techniques with predictions from previous dynamic programming models for RNA secondary structure. These results indicate that such models are capable of representing in-line hydrolysis with excellent accuracy, supporting their use for designing stabilized messenger RNAs. The integration of two crowdsourcing platforms, one for data set creation and another for machine learning, may be fruitful for other urgent problems that demand scientific discovery on rapid timescales.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest. DSK, WK, and RD hold equity and are employees of a new venture seeking to stabilize mRNA molecules. WR and MD are employees of Kaggle.

Figures

Figure 1.
Figure 1.
Dual-crowdsourcing setup for creating predictive models of RNA degradation. A. mRNA molecules fold into secondary structures containing unpaired regions prone to hydrolysis and limiting to therapeutic stability. B. Screenshot of the OpenVaccine Kaggle competition public leaderboard. C. Screenshot of an example construct designed by an Eterna participant in the “Roll Your Own Structure” challenge (“rainbow tetraloops 7” by Omei). D. tSNE projection of training sequences of “Roll-Your-Own-Structure” Round I, marker style and colors indicating 150 Eterna participants. Lines indicate example short 68 nt RNA fragments. E. Timelines of dual crowdsourced challenges. Eterna participants designed datasets that were used for training and blind test data for Kaggle machine learning competition to predict RNA chemical mapping signal and degradation. F. Kaggle participants were given RNA sequence and structure information and asked to predict RNA degradation profiles and SHAPE reactivity. Error bars represent standard deviation.
Figure 2.
Figure 2.
Signal-noise filtering and hierarchical clustering was used to filter the constructs designed by Eterna participants to create a test set of constructs that were maximally distant from other test constructs. Heatmaps of datatype “deg_Mg_pH10”.
Figure 3.
Figure 3.
Deep learning strategies used in competition. (A) Public test vs. private test performance of all teams in Kaggle challenge. Black star: experimental error. Red star: DegScore baseline model. Orange star: DegScore-XGB model using DegScore featurization with XGBoost. Purple star: baseline kernel used by many top-performing teams. (B) Distance embedding used to represent nucleotide proximity to other nucleotides in secondary structure. (C) Schematic of the single neural net (NN) architecture used by the first place solution. This solution combined two sets of features into a single NN architecture, which combined elements of classic RNNs and CNNs. (D) Schematic of the full solution pipeline for the second place solution. This solution combined single model neural networks, similar to the ones used for the first place solution, with more complex 2nd and 3rd level stacking using XGBoost as the higher level learner.
Figure 4.
Figure 4.
Deep-learning models can represent RNA-structure-based observables. (A) Representative structures from the best-predicted constructs from SHAPE modification (top row) and degradation at 10 mM Mg2+, pH 10, 1 day, 24 °C (Deg_Mg_pH10, bottom row). (B) Nullrecurrent model predictions and experimental signal, averaged over secondary structure motifs. (C) One failure mode for prediction came from constructs whose input secondary structure features were incorrectly predicted.
Figure 5.
Figure 5.
Kaggle models demonstrate improved performance in independent test of degradation of full-length mRNAs. (A) Overall mRNA degradation rate from PERSIST seq is driven by mRNA length. Kaggle models were therefore tested in their ability to predict length-averaged mRNA degradation. (B) Representative structures of two mRNAs of the same length that both encode Nanoluciferase, one with high degradation (“Yellowstone”, left) and low degradation (“LinearDesign-1”, right). (C) Prediction vectors were summed over nucleotides corresponding to the CDS region to compare to PERSIST-seq degradation rates, which account for degradation between two RT-PCR primers designed to capture degradation in the CDS region. (D) Length-normalized predictions from the Kaggle 1st place “Nullrecurrent” model and Kaggle 2nd place “Kazuki2” model show improved prediction over unpaired probabilities from ViennaRNA RNAfold and the DegScore linear regression model, and a version of the DegScore featurization with XGBoost training. Error bars represent standard error estimated from the PERSIST-seq experiment.

Similar articles

References

    1. Kramps T. & Elbers K. in Methods Mol Biol, Vol. 1499, Edn. 2016/12/18 1–11 (2017). - PubMed
    1. Kaczmarek J.C., Kowalski P.S. & Anderson D.G. Advances in the delivery of RNA therapeutics: from concept to clinical reality. Genome Med 9, 60 (2017). - PMC - PubMed
    1. Corbett K.S. et al. Evaluation of the mRNA-1273 Vaccine against SARS-CoV-2 in Nonhuman Primates. N Engl J Med 383, 1544–1555 (2020). - PMC - PubMed
    1. Baden L.R. et al. Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine. N Engl J Med 384, 403–416 (2021). - PMC - PubMed
    1. Polack F.P. et al. Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine. N Engl J Med 383, 2603–2615 (2020). - PMC - PubMed

Publication types

LinkOut - more resources