Deep learning models for predicting RNA degradation via dual crowdsourcing

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2021 Oct 14:arXiv:2110.07531v2.

Deep learning models for predicting RNA degradation via dual crowdsourcing

Hannah K Wayment-Steele^{1

2}, Wipapat Kladwang^{3

2}, Andrew M Watkins^{3

2}, Do Soon Kim^{3

2}, Bojan Tunguz^{3

4}, Walter Reade⁵, Maggie Demkin⁵, Jonathan Romano^{3

2

6}, Roger Wellington-Oguri², John J Nicol², Jiayang Gao⁷, Kazuki Onodera⁸, Kazuki Fujikawa⁹, Hanfei Mao¹⁰, Gilles Vandewiele¹¹, Michele Tinti¹², Bram Steenwinckel¹¹, Takuya Ito¹³, Taiga Noumi¹⁴, Shujun He¹⁵, Keiichiro Ishi¹⁶, Youhan Lee¹⁷, Fatih Öztürk¹⁸, Anthony Chiu¹⁹, Emin Öztürk²⁰, Karim Amer²¹, Mohamed Fares²², Eterna Participants², Rhiju Das^{3

2

23}

Affiliations

¹ Department of Chemistry, Stanford University, Stanford, California 94305, USA.
² Eterna Massive Open Laboratory.
³ Department of Biochemistry, Stanford University, California 94305, USA.
⁴ NVIDIA Corporation, Santa Clara, California 95051.
⁵ Kaggle, San Francisco, California 94107.
⁶ Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York, 14260, USA.
⁷ High-flyer AI, Hangzhou, Zhejiang, China, 310000.
⁸ NVIDIA Corporation, Minato-ku, Tokyo 107-0052, Japan.
⁹ DeNA, Shibuya-ku, Tokyo 150-6140, Japan.
¹⁰ Yanfu Investments, Shanghai, China, 200000.
¹¹ IDLab, Ghent University, Technologiepark-Zwijnaarde, Gent, Belgium, B-9052.
¹² College of Life Sciences, University of Dundee, Dundee DD1 4HN, United Kingdom.
¹³ Universal Knowledge Inc., Tokyo 150-0013, Japan.
¹⁴ Keyence Corporation, 1-3-14, Higashi-Nakajima, Higashi-Yodogawa-ku, Osaka, 533-8555, Japan.
¹⁵ Department of Chemical Engineering, Texas A&M University, College Station, TX 77843.
¹⁶ Rist Inc, Meguro-ku, Tokyo 153-0063, Japan.
¹⁷ Kakao Brain, Seongnam, Gyeonggi-do, Republic of Korea.
¹⁸ H2O, Istanbul, 3400, Turkey.
¹⁹ Clover Health, Hong Kong, 999077, PRC.
²⁰ Afiniti, Istanbul, 3400, Turkey.
²¹ Center for Informatics Science, Nile University, Sheikh Zayed, Giza, Egypt, 12588.
²² National Research Centre, Dokki, Cairo, Egypt, 12622.
²³ Department of Physics, Stanford University, California 94305, USA.

PMID: 34671698
PMCID: PMC8528079

Deep learning models for predicting RNA degradation via dual crowdsourcing

Hannah K Wayment-Steele et al. ArXiv. 2021.

[Preprint]. 2021 Oct 14:arXiv:2110.07531v2.

Authors

Affiliations

¹ Department of Chemistry, Stanford University, Stanford, California 94305, USA.
² Eterna Massive Open Laboratory.
³ Department of Biochemistry, Stanford University, California 94305, USA.
⁴ NVIDIA Corporation, Santa Clara, California 95051.
⁵ Kaggle, San Francisco, California 94107.
⁶ Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York, 14260, USA.
⁷ High-flyer AI, Hangzhou, Zhejiang, China, 310000.
⁸ NVIDIA Corporation, Minato-ku, Tokyo 107-0052, Japan.
⁹ DeNA, Shibuya-ku, Tokyo 150-6140, Japan.
¹⁰ Yanfu Investments, Shanghai, China, 200000.
¹¹ IDLab, Ghent University, Technologiepark-Zwijnaarde, Gent, Belgium, B-9052.
¹² College of Life Sciences, University of Dundee, Dundee DD1 4HN, United Kingdom.
¹³ Universal Knowledge Inc., Tokyo 150-0013, Japan.
¹⁴ Keyence Corporation, 1-3-14, Higashi-Nakajima, Higashi-Yodogawa-ku, Osaka, 533-8555, Japan.
¹⁵ Department of Chemical Engineering, Texas A&M University, College Station, TX 77843.
¹⁶ Rist Inc, Meguro-ku, Tokyo 153-0063, Japan.
¹⁷ Kakao Brain, Seongnam, Gyeonggi-do, Republic of Korea.
¹⁸ H2O, Istanbul, 3400, Turkey.
¹⁹ Clover Health, Hong Kong, 999077, PRC.
²⁰ Afiniti, Istanbul, 3400, Turkey.
²¹ Center for Informatics Science, Nile University, Sheikh Zayed, Giza, Egypt, 12588.
²² National Research Centre, Dokki, Cairo, Egypt, 12622.
²³ Department of Physics, Stanford University, California 94305, USA.

PMID: 34671698
PMCID: PMC8528079

Update in

Deep learning models for predicting RNA degradation via dual crowdsourcing.
Wayment-Steele HK, Kladwang W, Watkins AM, Kim DS, Tunguz B, Reade W, Demkin M, Romano J, Wellington-Oguri R, Nicol JJ, Gao J, Onodera K, Fujikawa K, Mao H, Vandewiele G, Tinti M, Steenwinckel B, Ito T, Noumi T, He S, Ishi K, Lee Y, Öztürk F, Chiu KY, Öztürk E, Amer K, Fares M; Eterna Participants; Das R. Wayment-Steele HK, et al. Nat Mach Intell. 2022;4(12):1174-1184. doi: 10.1038/s42256-022-00571-8. Epub 2022 Dec 14. Nat Mach Intell. 2022. PMID: 36567960 Free PMC article.

Abstract

Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a key task in designing more stable RNA-based therapeutics. Here, we describe a crowdsourced machine learning competition ("Stanford OpenVaccine") on Kaggle, involving single-nucleotide resolution measurements on 6043 102-130-nucleotide diverse RNA constructs that were themselves solicited through crowdsourcing on the RNA design platform Eterna. The entire experiment was completed in less than 6 months, and 41% of nucleotide-level predictions from the winning model were within experimental error of the ground truth measurement. Furthermore, these models generalized to blindly predicting orthogonal degradation data on much longer mRNA molecules (504-1588 nucleotides) with improved accuracy compared to previously published models. Top teams integrated natural language processing architectures and data augmentation techniques with predictions from previous dynamic programming models for RNA secondary structure. These results indicate that such models are capable of representing in-line hydrolysis with excellent accuracy, supporting their use for designing stabilized messenger RNAs. The integration of two crowdsourcing platforms, one for data set creation and another for machine learning, may be fruitful for other urgent problems that demand scientific discovery on rapid timescales.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest. DSK, WK, and RD hold equity and are employees of a new venture seeking to stabilize mRNA molecules. WR and MD are employees of Kaggle.

Figures

**Figure 1.**
Dual-crowdsourcing setup for creating predictive models of RNA degradation. A. mRNA molecules fold into secondary structures containing unpaired regions prone to hydrolysis and limiting to therapeutic stability. B. Screenshot of the OpenVaccine Kaggle competition public leaderboard. C. Screenshot of an example construct designed by an Eterna participant in the “Roll Your Own Structure” challenge (“rainbow tetraloops 7” by Omei). D. tSNE projection of training sequences of “Roll-Your-Own-Structure” Round I, marker style and colors indicating 150 Eterna participants. Lines indicate example short 68 nt RNA fragments. E. Timelines of dual crowdsourced challenges. Eterna participants designed datasets that were used for training and blind test data for Kaggle machine learning competition to predict RNA chemical mapping signal and degradation. F. Kaggle participants were given RNA sequence and structure information and asked to predict RNA degradation profiles and SHAPE reactivity. Error bars represent standard deviation.

**Figure 2.**
Signal-noise filtering and hierarchical clustering was used to filter the constructs designed by Eterna participants to create a test set of constructs that were maximally distant from other test constructs. Heatmaps of datatype “deg_Mg_pH10”.

**Figure 3.**
Deep learning strategies used in competition. (A) Public test vs. private test performance of all teams in Kaggle challenge. Black star: experimental error. Red star: DegScore baseline model. Orange star: DegScore-XGB model using DegScore featurization with XGBoost. Purple star: baseline kernel used by many top-performing teams. (B) Distance embedding used to represent nucleotide proximity to other nucleotides in secondary structure. (C) Schematic of the single neural net (NN) architecture used by the first place solution. This solution combined two sets of features into a single NN architecture, which combined elements of classic RNNs and CNNs. (D) Schematic of the full solution pipeline for the second place solution. This solution combined single model neural networks, similar to the ones used for the first place solution, with more complex 2nd and 3rd level stacking using XGBoost as the higher level learner.

**Figure 4.**
Deep-learning models can represent RNA-structure-based observables. (A) Representative structures from the best-predicted constructs from SHAPE modification (top row) and degradation at 10 mM Mg²⁺, pH 10, 1 day, 24 °C (Deg_Mg_pH10, bottom row). (B) Nullrecurrent model predictions and experimental signal, averaged over secondary structure motifs. (C) One failure mode for prediction came from constructs whose input secondary structure features were incorrectly predicted.

**Figure 5.**
Kaggle models demonstrate improved performance in independent test of degradation of full-length mRNAs. (A) Overall mRNA degradation rate from PERSIST seq is driven by mRNA length. Kaggle models were therefore tested in their ability to predict length-averaged mRNA degradation. (B) Representative structures of two mRNAs of the same length that both encode Nanoluciferase, one with high degradation (“Yellowstone”, left) and low degradation (“LinearDesign-1”, right). (C) Prediction vectors were summed over nucleotides corresponding to the CDS region to compare to PERSIST-seq degradation rates, which account for degradation between two RT-PCR primers designed to capture degradation in the CDS region. (D) Length-normalized predictions from the Kaggle 1st place “Nullrecurrent” model and Kaggle 2nd place “Kazuki2” model show improved prediction over unpaired probabilities from ViennaRNA RNAfold and the DegScore linear regression model, and a version of the DegScore featurization with XGBoost training. Error bars represent standard error estimated from the PERSIST-seq experiment.

See this image and copyright information in PMC

References

1. Kramps T. & Elbers K. in Methods Mol Biol, Vol. 1499, Edn. 2016/12/18 1–11 (2017). - PubMed
1. Kaczmarek J.C., Kowalski P.S. & Anderson D.G. Advances in the delivery of RNA therapeutics: from concept to clinical reality. Genome Med 9, 60 (2017). - PMC - PubMed
1. Corbett K.S. et al. Evaluation of the mRNA-1273 Vaccine against SARS-CoV-2 in Nonhuman Primates. N Engl J Med 383, 1544–1555 (2020). - PMC - PubMed
1. Baden L.R. et al. Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine. N Engl J Med 384, 403–416 (2021). - PMC - PubMed
1. Polack F.P. et al. Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine. N Engl J Med 383, 2603–2615 (2020). - PMC - PubMed

Publication types

Actions

Grants and funding

R35 GM122579/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

[1] Kramps T. & Elbers K. in Methods Mol Biol, Vol. 1499, Edn. 2016/12/18 1–11 (2017). - PubMed

[2] Kramps T. & Elbers K. in Methods Mol Biol, Vol. 1499, Edn. 2016/12/18 1–11 (2017). - PubMed

[3] Kaczmarek J.C., Kowalski P.S. & Anderson D.G. Advances in the delivery of RNA therapeutics: from concept to clinical reality. Genome Med 9, 60 (2017). - PMC - PubMed

[4] Kaczmarek J.C., Kowalski P.S. & Anderson D.G. Advances in the delivery of RNA therapeutics: from concept to clinical reality. Genome Med 9, 60 (2017). - PMC - PubMed

[5] Corbett K.S. et al. Evaluation of the mRNA-1273 Vaccine against SARS-CoV-2 in Nonhuman Primates. N Engl J Med 383, 1544–1555 (2020). - PMC - PubMed

[6] Corbett K.S. et al. Evaluation of the mRNA-1273 Vaccine against SARS-CoV-2 in Nonhuman Primates. N Engl J Med 383, 1544–1555 (2020). - PMC - PubMed

[7] Baden L.R. et al. Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine. N Engl J Med 384, 403–416 (2021). - PMC - PubMed

[8] Baden L.R. et al. Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine. N Engl J Med 384, 403–416 (2021). - PMC - PubMed

[9] Polack F.P. et al. Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine. N Engl J Med 383, 2603–2615 (2020). - PMC - PubMed

[10] Polack F.P. et al. Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine. N Engl J Med 383, 2603–2615 (2020). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Deep learning models for predicting RNA degradation via dual crowdsourcing

Affiliations

Deep learning models for predicting RNA degradation via dual crowdsourcing

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

This is a preprint.

Update in

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources