Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 4;16(1):4155.
doi: 10.1038/s41467-025-59389-8.

Generative and predictive neural networks for the design of functional RNA molecules

Affiliations

Generative and predictive neural networks for the design of functional RNA molecules

Aidan T Riley et al. Nat Commun. .

Abstract

RNA is a remarkably versatile molecule that has been engineered for applications in therapeutics, diagnostics, and in vivo information-processing systems. However, the complex relationship between the sequence, structure, and function of RNA often necessitates extensive experimental screening of candidate sequences. Here we present a generalized, efficient neural network architecture that utilizes the sequence and structure of RNA molecules (SANDSTORM) to inform functional predictions across a diverse range of settings. We pair these predictive models with generative adversarial RNA design networks (GARDN), allowing the generative modelling of a diverse range of functional RNA molecules with targeted experimental attributes. This approach enables the design of novel sequence candidates that outperform those encountered during training or returned by classical thermodynamic algorithms, and can be deployed using as few as 384 example sequences. SANDSTORM and GARDN thus represent powerful new predictive and generative tools for the development of RNA molecules with improved function.

PubMed Disclaimer

Conflict of interest statement

Competing interests: A.A.G. is a co-founder of En Carta Diagnostics, Inc. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the SANDSTORM and GARDN models.
a SANDSTORM expands upon previous sequence-to-function neural networks by incorporating both sequence and secondary structure array input channels. These paired inputs are passed through parallel convolutional stacks that form an ensemble prediction of input RNA function (see Supplementary Fig. 1a for a detailed depiction of SANDSTORM). b GARDN is a generative adversarial network architecture which accepts a random variable input (Z) and is tasked with designing realistic examples of functional RNAs (see Supplementary Fig. 1b for a detailed depiction of the GARDN generator). c A trained GARDN generator can be paired with a SANDSTORM predictive model to return realistic sequences with targeted experimental values.
Fig. 2
Fig. 2. Extracting secondary structure information using a simulated dataset.
a A dataset of toehold switches and several types of decoy sequences was utilized to determine whether a CNN could differentiate sequences on a structural basis. Sequences consisted of canonical toehold switches, decoys containing only an RBS motif (RBS decoys), decoys containing only a start codon motif (AUG decoys), decoys containing both an RBS and start codon motif (RBS + AUG decoys), and decoys that adopted the canonical secondary structure but did not contain the necessary sequence motifs (binding decoys). b A CNN trained only on one-hot-encoded sequences was not able to perfectly classify the canonical toehold switches from the RBS + AUG decoys, which are only differentiable at a structural level. c The SANDSTORM CNN accepting paired sequence and structure arrays was able to identify both the sequence and structural features required to classify the simulated dataset. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. SANDSTORM models can predict the function of several classes of RNA molecules.
a SANDSTORM prediction metrics for toehold switch ON and OFF values compared to NuSpeak/STORM. b SANDSTORM predictions for 5′ UTR (untranslated region) mean ribosome loading compared to Optimus 5-prime. c SANDSTORM model predicting RBS (ribosome binding site) translation efficiency compared to SAPIENS. d SANDSTORM predictions of Cas13a collateral cleavage efficiency using guide RNA-target pairs compared to ADAPT. Bars represent means across three independent training-testing splits ± s.d. Links to comparator datasets and code repositories are available in Supplementary Data 1. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. GARDN design of ribosome binding sites.
a The nucleotide distributions of RBS (Ribosome-binding site) sequences returned by the GARDN model (top), the GARDN model optimized by a trained SANDSTORM-RBS predictor (middle) and the three highest performing sequences from the experimental dataset used for training (bottom) (see Supplementary Data 2 for sequences). b GFP expression measurements in BL 21 Star (DE3) E. coli as determined by flow cytometry for RBS sequences designed using the GARDN-SANDSTORM approach and the three highest-performing sequences reported in the original dataset used for training. See Methods (Flow Cytometry Analysis) and Supplemental Information for representative gating strategy. Bars represent the fluorescence measurement mean ± s.d. Post-optimized bars denoted with * indicate a statistically significant (p < 0.05) increase in fluorescence compared to all three high-throughput examples as measured by a one-sided t test. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. GARDN Models can design toehold switches with realistic secondary structures.
af Ensemble overlays of the secondary structures of toehold switch sequences designed using a variety of algorithms and those included in the experimental training data: (a) NUPACK design algorithm applied without target RNA sequence constraints, (b) the high-throughput experimental dataset used for model training, (c) single sequence activation maximization, (d) single sequence activation maximization with starting seeds containing the constant motifs, (e) GARDN, and (f) GARDN-SANDSTORM with toehold switches optimized to have high ON values (n = 300 af). Predicted structures were calculated using standard structural prediction software, with the bolded structure representing the most likely structural character at each position (unpaired for heterogenous sequence groups where the most likely structure is not valid). g Structural agreement of GARDN-generated toehold switch sequences increases over adversarial training iterations, until converging to the experimental average. h Final structural agreement between toehold switches designed using different algorithms and the target canonical toehold switch secondary structure (quantification of af see Methods). i ON-value optimization results for GARDN-designed toehold switch sequences against the SANDSTORM-toehold predictive model. 300 calls to the predictive model resulted in a 37% increase in average ON score. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. GARDN-designed toehold switches show improved experimental performance in E. coli.
a Individual toehold switch sequences designed by GARDN to have a high ON-state expression (see Supplementary Data 3 for sequences) show a geometric mean of 3.8-fold increase in GFP expression in the presence of their cognate trigger in E. coli after optimization by a SANDSTORM predictive model (p = 0.016, one-sided Wilcoxon ranked-sum test). b GARDN and ON-optimized GARDN-SANDSTORM toehold switches demonstrate a 1.9-fold and 4.8-fold increase in ON state fluorescence (p = 0.008, p = 0.002, one-sided Mann–Whitney U test). c, Toehold switches designed to exhibit high ON/OFF ratios by the GARDN-SANDSTORM routine demonstrate a geometric mean increase in ON/OFF ratio of 3.7-fold using 300 calls to a predictive SANDSTORM model (p = 0.016, Wilcoxon ranked-sum test). d Both non-optimized GARDN sequences and GARDN-SANDSTORM optimized groups show a significant increase in performance of 3.8-fold (p = 0.002, one-sided Mann–Whitney U test) and 11.9-fold (p = 0.002, one-sided Mann–Whitney U test), respectively, compared to those designed using classic inverse design software. See Methods (Flow Cytometry Analysis) and Supplemental Information for representative flow cytometry gating strategy. Bars represent the mean ± s.d. Source data are provided as a Source Data file.
Fig. 7
Fig. 7. Screening a limited dataset of aptaswitches.
a Aptaswitches couple a toehold switch-like RNA hairpin to the conditional formation of a reporter aptamer. Fluorescence is generated in the presence of both the complementary target DNA molecule and the aptamer ligand. b A library of 384 aptaswitches were screened experimentally to evaluate the SANDSTORM model’s capacity on limited datasets. c The aptaswitch library was constructed by converting toehold switch sequences from Angenet-Mari et al. into aptaswitches and manually characterized for their ON/OFF ratios (see Supplementary Data 4 for sequences). d The performance of converted aptaswitch elements varied greatly from their original context, demonstrating the need for a domain-specific predictor to be trained on the novel dataset. Toehold switch ON/OFF ratios are reported in the normalized value range of −1 to 1 from the original high throughput dataset. e The corresponding aptaswitch for all possible tiles of the covid N gene (n = 1230) were ranked in silico using SANDSTORM predicted ON/OFF ratios. The same process was repeated, instead using the NUPACK ensemble defect of the switch complex or switch-trigger complex to identify optimal regions from the same pool of candidate tiles. f The ON/OFF ratios of the top 6 aptaswitches identified using each approach are reported in addition to the 6 predicted lowest sequences identified using SANDSTORM. Bars represent the population mean ± s.d. Source data are provided as a Source Data file. Created in BioRender. Riley, A. (2025) https://BioRender.com/j80a514.
Fig. 8
Fig. 8. GARDN design of aptaswitches using limited data.
a The pretrained GARDN-toehold model was optimized by a SANDSTORM predictor trained on aptaswitch performance. The GARDN model generates 60-nt toehold switch hairpins from a random variable (Z), which were manually concatenated to the aptamer core as well as the necessary repeat of the (b) domain for aptamer formation during runtime. b The ON/OFF ratios of sequences designed using NUPACK, GARDN-SANDSTORM, and GARDN-SANDSTORM under an extended optimization routine (see Supplementary Data 5 for sequences). Data points are the normalized ON/OFF ratios for individual aptaswitch sequences. Bars are the population mean ± s.d. c The ON/OFF ratios of the paired pre- and post-optimization samples designed by SANDSTORM model A for 300 optimization steps. 7/12 samples show an improved post-optimization ON/OFF value. d The ON/OFF ratios for paired samples resulting from 600 optimization steps under the guidance of model A. 10/12 samples show an improved post-optimization ON/OFF value. Bars indicate the normalized ON/OFF value extrapolated from the ON and OFF state measurements demonstrated at three different aptaswitch concentrations. Source data are provided as a Source Data file.

Update of

References

    1. Damase, T. R. et al. The Limitless Future of RNA Therapeutics. Front Bioeng. Biotechnol.9, 628137 (2021). - PMC - PubMed
    1. Pickar-Oliver, A. & Gersbach, C. A. The next generation of CRISPR–Cas technologies and applications. Nat. Rev. Mol. Cell Biol.20, 490–507 (2019). - PMC - PubMed
    1. Green, A. A., Silver, P. A., Collins, J. J. & Yin, P. Toehold switches: de-novo-designed regulators of gene expression. Cell159, 925–939 (2014). - PMC - PubMed
    1. Bayer, T. S. & Smolke, C. D. Programmable ligand-controlled riboregulators of eukaryotic gene expression. Nat. Biotechnol.23, 337–343 (2005). - PubMed
    1. Ma, D. et al. Multi-arm RNA junctions encoding molecular logic unconstrained by input sequence for versatile cell-free diagnostics. Nat. Biomed. Eng.6, 298–309 (2022). - PMC - PubMed

LinkOut - more resources