Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Dec 15;31(24):7280-301.
doi: 10.1093/nar/gkg938.

A statistical sampling algorithm for RNA secondary structure prediction

Affiliations

A statistical sampling algorithm for RNA secondary structure prediction

Ye Ding et al. Nucleic Acids Res. .

Abstract

An RNA molecule, particularly a long-chain mRNA, may exist as a population of structures. Further more, multiple structures have been demonstrated to play important functional roles. Thus, a representation of the ensemble of probable structures is of interest. We present a statistical algorithm to sample rigorously and exactly from the Boltzmann ensemble of secondary structures. The forward step of the algorithm computes the equilibrium partition functions of RNA secondary structures with recent thermodynamic parameters. Using conditional probabilities computed with the partition functions in a recursive sampling process, the backward step of the algorithm quickly generates a statistically representative sample of structures. With cubic run time for the forward step, quadratic run time in the worst case for the sampling step, and quadratic storage, the algorithm is efficient for broad applicability. We demonstrate that, by classifying sampled structures, the algorithm enables a statistical delineation and representation of the Boltzmann ensemble. Applications of the algorithm show that alternative biological structures are revealed through sampling. Statistical sampling provides a means to estimate the probability of any structural motif, with or without constraints. For example, the algorithm enables probability profiling of single-stranded regions in RNA secondary structure. Probability profiling for specific loop types is also illustrated. By overlaying probability profiles, a mutual accessibility plot can be displayed for predicting RNA:RNA interactions. Boltzmann probability-weighted density of states and free energy distributions of sampled structures can be readily computed. We show that a sample of moderate size from the ensemble of an enormous number of possible structures is sufficient to guarantee statistical reproducibility in the estimates of typical sampling statistics. Our applications suggest that the sampling algorithm may be well suited to prediction of mRNA structure and target accessibility. The algorithm is applicable to the rational design of small interfering RNAs (siRNAs), antisense oligonucleotides, and trans-cleaving ribozymes in gene knock-down studies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flowchart for recursive sampling of an RNA secondary structure according to the Boltzmann equilibrium distribution. For the fragment from the ith base to the jth base, I = 1 for (i, j, I) if it is known that the ends form a pair, or I = 0 if this pair is unknown.
Figure 2
Figure 2
Two-dimensional histograms (2Dhist) for classes 1A, 1B and 1C for L.collosoma SL RNA. The 2Dhist shows the frequencies of base pairs, with nucleotide position on both axes. Within each histogram, the sizes of the solid squares are proportional to the frequencies of the base pairs. (A) Class 1A is represented by structure form 1 (20). (B) Class 1B is represented by the optimal folding from version 3.1 of mfold. (C) For structures in class 1C, the hairpin and the two helices on the top of form 1 are conserved.
Figure 2
Figure 2
Two-dimensional histograms (2Dhist) for classes 1A, 1B and 1C for L.collosoma SL RNA. The 2Dhist shows the frequencies of base pairs, with nucleotide position on both axes. Within each histogram, the sizes of the solid squares are proportional to the frequencies of the base pairs. (A) Class 1A is represented by structure form 1 (20). (B) Class 1B is represented by the optimal folding from version 3.1 of mfold. (C) For structures in class 1C, the hairpin and the two helices on the top of form 1 are conserved.
Figure 2
Figure 2
Two-dimensional histograms (2Dhist) for classes 1A, 1B and 1C for L.collosoma SL RNA. The 2Dhist shows the frequencies of base pairs, with nucleotide position on both axes. Within each histogram, the sizes of the solid squares are proportional to the frequencies of the base pairs. (A) Class 1A is represented by structure form 1 (20). (B) Class 1B is represented by the optimal folding from version 3.1 of mfold. (C) For structures in class 1C, the hairpin and the two helices on the top of form 1 are conserved.
Figure 3
Figure 3
2Dhist for classes 2A and 2B for L.collosoma SL RNA. (A) Class 2A is represented by structure form 2 (20). (B) Class 2B is for structures with an additional stem on the 5′ end of form 2.
Figure 3
Figure 3
2Dhist for classes 2A and 2B for L.collosoma SL RNA. (A) Class 2A is represented by structure form 2 (20). (B) Class 2B is for structures with an additional stem on the 5′ end of form 2.
Figure 4
Figure 4
The representative structures for classes 1A, 1B and 1C for L.collosoma SL RNA. (A) Structure form 1 (20) for class 1A. (B) The optimal folding by mfold for class 1B. (C) The representative for class 1C.
Figure 4
Figure 4
The representative structures for classes 1A, 1B and 1C for L.collosoma SL RNA. (A) Structure form 1 (20) for class 1A. (B) The optimal folding by mfold for class 1B. (C) The representative for class 1C.
Figure 4
Figure 4
The representative structures for classes 1A, 1B and 1C for L.collosoma SL RNA. (A) Structure form 1 (20) for class 1A. (B) The optimal folding by mfold for class 1B. (C) The representative for class 1C.
Figure 5
Figure 5
The representative structures for classes 2A and 2B for L.collosoma SL RNA. (A) Structure form 2 (20) for class 2A. (B) The representative for class 2B.
Figure 5
Figure 5
The representative structures for classes 2A and 2B for L.collosoma SL RNA. (A) Structure form 2 (20) for class 2A. (B) The representative for class 2B.
Figure 6
Figure 6
Bar plot comparing the probability (estimated by the frequency in a sample) of a class (open bar) with the Boltzmann probability (filled bar) for the representative structure of a class. Classes are from the structure classification for L.collosoma SL RNA.
Figure 7
Figure 7
Alternative structures for the mRNA of the cIII gene of bacteriophage λ. The initiation codon and the Shine–Dalgarno sequence are A0UG2 and U–13AAGGAG–7. The substructure from the 5′ end to nucleotide A–9 is the same for structure A and structure B. (A) Structure A proposed by Altuvia et al. (8). (B) Structure B proposed by Altuvia et al. (8). (C) Structure C represents a modification of B by an additional short helix involving a part of the Shine–Dalgarno sequence.
Figure 7
Figure 7
Alternative structures for the mRNA of the cIII gene of bacteriophage λ. The initiation codon and the Shine–Dalgarno sequence are A0UG2 and U–13AAGGAG–7. The substructure from the 5′ end to nucleotide A–9 is the same for structure A and structure B. (A) Structure A proposed by Altuvia et al. (8). (B) Structure B proposed by Altuvia et al. (8). (C) Structure C represents a modification of B by an additional short helix involving a part of the Shine–Dalgarno sequence.
Figure 7
Figure 7
Alternative structures for the mRNA of the cIII gene of bacteriophage λ. The initiation codon and the Shine–Dalgarno sequence are A0UG2 and U–13AAGGAG–7. The substructure from the 5′ end to nucleotide A–9 is the same for structure A and structure B. (A) Structure A proposed by Altuvia et al. (8). (B) Structure B proposed by Altuvia et al. (8). (C) Structure C represents a modification of B by an additional short helix involving a part of the Shine–Dalgarno sequence.
Figure 8
Figure 8
Comparison of predictions by sampling and by free energy minimization. At nucleotide position i, the probability that nucleotide i, i + 1, i + 2, i + 3 (i.e. fragment width W = 4) are all single stranded is plotted against i. This probability is computed by a sample of 1000 structures (probability profile), by MFE structure and by ss-count from mfold for the nucleotides 1–60 (A) and 1262–1322 regions (B) of the mRNA for H.sapiens γ-glutamyl hydrolase (GenBank accession No. U55206, with 66 additional nucleotides at the 5′ end).
Figure 8
Figure 8
Comparison of predictions by sampling and by free energy minimization. At nucleotide position i, the probability that nucleotide i, i + 1, i + 2, i + 3 (i.e. fragment width W = 4) are all single stranded is plotted against i. This probability is computed by a sample of 1000 structures (probability profile), by MFE structure and by ss-count from mfold for the nucleotides 1–60 (A) and 1262–1322 regions (B) of the mRNA for H.sapiens γ-glutamyl hydrolase (GenBank accession No. U55206, with 66 additional nucleotides at the 5′ end).
Figure 9
Figure 9
Loop profiles for E.coli tRNAAla. (A) Hplot displays the probability that a base lies in a hairpin loop; (B) Bplot displays the probability that a base is in a bulge loop; (C) Iplot displays the probability that a base is in an interior (internal) loop; (D) Mplot displays the probability that a base is in a multibranched loop; and (E) Extplot displays the probability that a base is in the exterior loop.
Figure 9
Figure 9
Loop profiles for E.coli tRNAAla. (A) Hplot displays the probability that a base lies in a hairpin loop; (B) Bplot displays the probability that a base is in a bulge loop; (C) Iplot displays the probability that a base is in an interior (internal) loop; (D) Mplot displays the probability that a base is in a multibranched loop; and (E) Extplot displays the probability that a base is in the exterior loop.
Figure 9
Figure 9
Loop profiles for E.coli tRNAAla. (A) Hplot displays the probability that a base lies in a hairpin loop; (B) Bplot displays the probability that a base is in a bulge loop; (C) Iplot displays the probability that a base is in an interior (internal) loop; (D) Mplot displays the probability that a base is in a multibranched loop; and (E) Extplot displays the probability that a base is in the exterior loop.
Figure 9
Figure 9
Loop profiles for E.coli tRNAAla. (A) Hplot displays the probability that a base lies in a hairpin loop; (B) Bplot displays the probability that a base is in a bulge loop; (C) Iplot displays the probability that a base is in an interior (internal) loop; (D) Mplot displays the probability that a base is in a multibranched loop; and (E) Extplot displays the probability that a base is in the exterior loop.
Figure 9
Figure 9
Loop profiles for E.coli tRNAAla. (A) Hplot displays the probability that a base lies in a hairpin loop; (B) Bplot displays the probability that a base is in a bulge loop; (C) Iplot displays the probability that a base is in an interior (internal) loop; (D) Mplot displays the probability that a base is in a multibranched loop; and (E) Extplot displays the probability that a base is in the exterior loop.
Figure 10
Figure 10
(A) Mutual accessibility plot obtained by overlaying probability profiles (fragment width W = 4) at the target site for a 60 nt antisense RNA (embedded in a long RNA through an expression vector) and the targeted mRNA of H.sapiens γ-glutamyl hydrolase. Both the RNA containing the 60 nt antisense insert and the entire target mRNA were folded. Fairly good mutual accessibility is predicted by the overlapping high probability region between nucleotides 730 and 750. (B) For the mRNA of H.sapiens breast cancer resistance protein (BCRP; GenBank accession No. AF098951) and an hammerhead ribozyme designed for a GUC cleavage sequence on the target, fairly good mutual accessibility for the nucleation step of antisense binding is predicted for both the target and the two binding arms of the ribozyme (W = 1 for the probability profiles).
Figure 10
Figure 10
(A) Mutual accessibility plot obtained by overlaying probability profiles (fragment width W = 4) at the target site for a 60 nt antisense RNA (embedded in a long RNA through an expression vector) and the targeted mRNA of H.sapiens γ-glutamyl hydrolase. Both the RNA containing the 60 nt antisense insert and the entire target mRNA were folded. Fairly good mutual accessibility is predicted by the overlapping high probability region between nucleotides 730 and 750. (B) For the mRNA of H.sapiens breast cancer resistance protein (BCRP; GenBank accession No. AF098951) and an hammerhead ribozyme designed for a GUC cleavage sequence on the target, fairly good mutual accessibility for the nucleation step of antisense binding is predicted for both the target and the two binding arms of the ribozyme (W = 1 for the probability profiles).
Figure 11
Figure 11
For L.collosoma SL RNA, (A) Boltzmann probability-weighted density of states (BPWDOS); (B) probability of a structure having a free energy within P% of the minimum free energy; (C) probability of a structure having a free energy within a specified P-interval.
Figure 11
Figure 11
For L.collosoma SL RNA, (A) Boltzmann probability-weighted density of states (BPWDOS); (B) probability of a structure having a free energy within P% of the minimum free energy; (C) probability of a structure having a free energy within a specified P-interval.
Figure 11
Figure 11
For L.collosoma SL RNA, (A) Boltzmann probability-weighted density of states (BPWDOS); (B) probability of a structure having a free energy within P% of the minimum free energy; (C) probability of a structure having a free energy within a specified P-interval.
Figure 12
Figure 12
Statistical reproducibility is illustrated by 2Dhist for two independent samples of 1000 structures each for the mRNA of H.sapiens N-acetylglucosamine kinase. The histograms (A and B) display nearly identical patterns of base pair probabilities estimated by sampling.
Figure 12
Figure 12
Statistical reproducibility is illustrated by 2Dhist for two independent samples of 1000 structures each for the mRNA of H.sapiens N-acetylglucosamine kinase. The histograms (A and B) display nearly identical patterns of base pair probabilities estimated by sampling.
Figure 13
Figure 13
Statistical reproducibility is illustrated by nearly complete overlapping of probability profiles for two independent samples of 1000 structures each for the mRNA of H.sapiens N-acetylglucosamine kinase. (A) Complete profiles for the entire mRNA. (B) An enlargement of the profile for the nucleotide 400–600 region.
Figure 13
Figure 13
Statistical reproducibility is illustrated by nearly complete overlapping of probability profiles for two independent samples of 1000 structures each for the mRNA of H.sapiens N-acetylglucosamine kinase. (A) Complete profiles for the entire mRNA. (B) An enlargement of the profile for the nucleotide 400–600 region.
Figure a1
Figure a1
Elements of RNA secondary structure: helix, hairpin loop, bulge loop, interior (internal) loop and multibranched loop.
Figure a2
Figure a2
In the derivation of recursions for u(i, j), mutually exclusive and exhaustive cases are enumerated by considering fragment Rij being single stranded or the base pair rhrl closest to the 5′ end of the fragment [i.e. the first (hi) bases are single stranded]: (a) Rij is single stranded; (b) h = i, l = j; (c) i < h < l = j; (d) h = i < l < j; and (e) i < h < l < j.

References

    1. More P.B. and Steitz,T.A. (2003). The structural basis of large ribosomal subunit function. Annu. Rev. Biochem., 72, 813–850. - PubMed
    1. Sprinzl M., Horn,C., Brown,M., Ioudovitch,A. and Steinberg,S. (1998) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res., 26, 148–153. - PMC - PubMed
    1. McCaskill J.S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119. - PubMed
    1. Bonhoeffer S., McCaskill,J.S., Stadler,P.F. and Schuster,P. (1993) RNA multi-structure landscapes. A study based on temperature dependent partition functions. Eur. Biophys. J., 22, 13–24. - PubMed
    1. Christoffersen R.E., McSwiggen,J.A. and Konings,D. (1994) Application of computational technologies to ribozyme biotechnology products. J. Mol. Struct. (Theochem.), 311, 273–284.

Publication types