Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 28;40(Suppl 1):i347-i356.
doi: 10.1093/bioinformatics/btae259.

RiboDiffusion: tertiary structure-based RNA inverse folding with generative diffusion models

Affiliations

RiboDiffusion: tertiary structure-based RNA inverse folding with generative diffusion models

Han Huang et al. Bioinformatics. .

Abstract

Motivation: RNA design shows growing applications in synthetic biology and therapeutics, driven by the crucial role of RNA in various biological processes. A fundamental challenge is to find functional RNA sequences that satisfy given structural constraints, known as the inverse folding problem. Computational approaches have emerged to address this problem based on secondary structures. However, designing RNA sequences directly from 3D structures is still challenging, due to the scarcity of data, the nonunique structure-sequence mapping, and the flexibility of RNA conformation.

Results: In this study, we propose RiboDiffusion, a generative diffusion model for RNA inverse folding that can learn the conditional distribution of RNA sequences given 3D backbone structures. Our model consists of a graph neural network-based structure module and a Transformer-based sequence module, which iteratively transforms random sequences into desired sequences. By tuning the sampling weight, our model allows for a trade-off between sequence recovery and diversity to explore more candidates. We split test sets based on RNA clustering with different cut-offs for sequence or structure similarity. Our model outperforms baselines in sequence recovery, with an average relative improvement of 11% for sequence similarity splits and 16% for structure similarity splits. Moreover, RiboDiffusion performs consistently well across various RNA length categories and RNA types. We also apply in silico folding to validate whether the generated sequences can fold into the given 3D RNA backbones. Our method could be a powerful tool for RNA design that explores the vast sequence space and finds novel solutions to 3D structural constraints.

Availability and implementation: The source code is available at https://github.com/ml4bio/RiboDiffusion.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Overview of RiboDiffusion for tertiary structure-based RNA inverse folding. We construct a dataset with experimentally determined RNA structures from PDB, supplemented with additional structures predicted by an RNA structure prediction model. We cluster RNA with different cut-offs for sequence or structure similarity and make cross-split to evaluate models. RiboDiffusion trains a neural network with a structure module and a sequence module to recover the original sequence from a noisy sequence and a coarse-grained RNA backbone extracted from the tertiary structure. RiboDiffusion then uses the trained network to iteratively refine random initial sequences until they match the target structure. We present a comprehensive evaluation and analysis of the proposed method.
Figure 2.
Figure 2.
Violin plots for the recovery rate distribution of methods for different types of RNA, including tRNA, rRNA, sRNA, ribozyme, snRNA, SRP RNA, hammerhead ribozyme, and pre miRNA.
Figure 3.
Figure 3.
Performance of RiboDiffusion on different RNA families under the cross-family setting. The average length and number of tertiary structures for each family are marked above violin plots.
Figure 4.
Figure 4.
Analysis of RiboDiffusion. (a, b) In silico folding validation results that show the TM-score between structures predicted by RhoFold or DRFold and the given fixed RNA backbones (on Seq. 0.4 split). Native represents structures predicted from original sequences of given backbones as references, while Generated represents structures predicted from generated sequences. (c, d) Trade-offs between the diversity of generated sequences and recovery rate, as well as refolding F1-score (including models with and without augmented data). (e) Visualization of input RNA structures (pink) and predicted structures (green) of generated sequences. The generated sequences and the corresponding native sequences are shown below the structure visualization, where different nucleotide types are marked in red.

Similar articles

Cited by

References

    1. Andronescu M, Fejes AP, Hutter F. et al. A new algorithm for RNA secondary structure design. J Mol Biol 2004;336:607–24. - PubMed
    1. Baek M, McHugh R, Anishchenko I. et al. Accurate prediction of protein-nucleic acid complexes using rosettafoldna. Nat Methods 2024;21:117–21. - PMC - PubMed
    1. Bank PD. Protein data bank. Nature New Biol 1971;233:223. - PubMed
    1. Benhenda M. ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity? arXiv, arXiv:1708.08227, 2017, preprint: not peer reviewed.
    1. Busch A, Backofen R.. Info-RNA – a fast approach to inverse RNA folding. Bioinformatics 2006;22:1823–31. - PubMed

Publication types