Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006;34(17):e117.
doi: 10.1093/nar/gkl544. Epub 2006 Sep 20.

Using RNA secondary structures to guide sequence motif finding towards single-stranded regions

Affiliations

Using RNA secondary structures to guide sequence motif finding towards single-stranded regions

Michael Hiller et al. Nucleic Acids Res. 2006.

Abstract

RNA binding proteins recognize RNA targets in a sequence specific manner. Apart from the sequence, the secondary structure context of the binding site also affects the binding affinity. Binding sites are often located in single-stranded RNA regions and it was shown that the sequestration of a binding motif in a double-strand abolishes protein binding. Thus, it is desirable to include knowledge about RNA secondary structures when searching for the binding motif of a protein. We present the approach MEMERIS for searching sequence motifs in a set of RNA sequences and simultaneously integrating information about secondary structures. To abstract from specific structural elements, we precompute position-specific values measuring the single-strandedness of all substrings of an RNA sequence. These values are used as prior knowledge about the motif starts to guide the motif search. Extensive tests with artificial and biological data demonstrate that MEMERIS is able to identify motifs in single-stranded regions even if a stronger motif located in double-strand parts exists. The discovered motif occurrences in biological datasets mostly coincide with known protein-binding sites. This algorithm can be used for finding the binding motif of single-stranded RNA-binding proteins in SELEX or other biological sequence data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Effect of the pseudocount on the prior probability distribution. The figure shows a randomly chosen sequence and its optimal secondary structure (A), the EF and PU values for a motif length of 6 nt (B), and the prior probability distribution for a OOPS/ZOOPS model using EF (C) and PU values (D) with different pseudocounts. Each data point represents the value for the motif starting at the respective position. The uniform prior refers to a prior probability distribution p = 1/31 (sequence length is 36 nt).
Figure 2
Figure 2
Overview of the artificial test sets. (A) The figure shows an artificial sequence with a single-stranded motif (ssMotif, highlighted yellow) and a double-stranded motif (dsMotif, highlighted blue) together with its optimal secondary structure. The general scheme for constructing sequences is (i) to randomly sample an up- and downstream flank with a total length of 20 nt, (ii) to generate a stem of 12 bp that contains the dsMotif and (iii) to insert the ssMotif as the hairpin loop. The dsMotif can occur on either side of the stem. (B) The sequences in test sets 1–4 contain a ssMotif as well as a dsMotif. For test sets 1 and 3, we used a fixed string as the ssMotif (ACCGTA in this example, highlighted yellow) and a permutation of it as the dsMotif (TGACAC, blue). These motifs are sampled from two PSPMs for test sets 2 and 4. In test set 3, a single mutation is introduced in 25% of the ssMotifs. (C) Test set 5 contains only one motif in double-stranded conformation (sampled from a PSPM). (D) Sequences in test set 6 contain a 12 nt motif as a fixed string where only the 6 nt in the middle of the motif (yellow) are single-stranded. (E) Sequences in test set 7 contain either a ssMotif, a dsMotif or no motif (sampled from a PSPM). (F) Test set 8 contains sequences with a ssMotif and a dsMotif, with two ssMotifs, with one ssMotif, with one dsMotif, and without a motif (sampled from a PSPM). The percentages indicate to which fraction sequences with the respective features are contained in the dataset.
Figure 3
Figure 3
Effect of varying the pseudocount. The figure shows the information content of the motif matrix found by MEMERIS in bits (black curve) and its average single-strandedness (average PU values of all motif occurrences, blue curve) for pseudocounts from 0 to 0.5 in steps of 0.01. Test set 5 that contain sequences with only one dsMotif (10.6 bits, average single-strandedness 0.003) was used. This motif is found by MEMERIS for a pseudocount greater than 0.22. In general, the lower the pseudocount, the higher is the average single-strandedness.
Figure 4
Figure 4
Comparison of MEME and MEMERIS for test set 8 (testing the TCM model). The figure shows 20 sequences that contain ssMotifs (highlighted yellow) and/or dsMotifs (highlighted light blue). The optimal structure is shown below each sequence. Red and blue bars indicate the position of the motif occurrences found by MEMERIS and MEME, respectively. While MEMERIS detects all ssMotifs and no dsMotif leading to an information content of the motif matrix of 10.4 bits, MEME identifies a stronger motif (11.1 bits) but detects eight dsMotif occurrences. MEMERIS results are shown for PU values and a pseudocount of 0.01. The number of motif hits was set to 21 for MEME and MEMERIS.
Figure 5
Figure 5
Comparison of MEME and MEMERIS for the SELEX sequences of the Nova-1 protein. The figure shows the sequences and labels of the individual clones described in (6). The random oligonucleotides are in blue letters. The optimal secondary structure is shown below each sequence. The primer binding sites (black letters) were included in the RNA secondary structure prediction but not in the motif search. Yellow bars represent the TCAT and ACAT motifs identified in (6). Blue and green bars indicate the position of the motif hits found by MEME and MEMERIS, respectively. The motif matrix found by MEME has an information content of 7.6 bits, the MEMERIS motif matrix has 7.4 bits. MEME and MEMERIS were run with the TCM model and the number of motif hits was set to 33. MEMERIS results are shown for PU values and a pseudocount of 0.01.
Figure 6
Figure 6
Results of MEME and MEMERIS for the PIE Rfam (RF00460) dataset. The figure shows the consensus sequence and structure of the PIE RNA. The U1A protein binds the single-stranded sequences in the two asymmetrical internal loops in a cooperative manner (A). Using the OOPS model, MEME finds two motifs (14 and 13.3 bits, respectively) that do not overlap the real binding site (B) while MEMERIS finds the real upstream binding site exactly (11.8 bits) and the downstream site (10.5 bits) with a shift of one position. (C) Since both individual binding sites are very similar, we used the TCM model to search for a motif with two occurrences in each sequence. Again MEME finds a different motif (11.6 bits) (D) while MEMERIS detects the correct protein-binding sites (10.7 bits) (E). The known binding sites and the predicted motifs are highlighted in blue. The motif length was set to 7 nt. For MEMERIS, the PU values were used with a pseudocount of 0.01.
Figure 7
Figure 7
Results of MEME and MEMERIS for the TAR Rfam (RF00250) dataset. The figure shows the consensus sequence and structure of the TAR element. The hairpin loop is bound by the Tat protein (A). We searched for one binding site in each sequence (OOPS model) with MEME (B) and MEMERIS using PU values (C). MEME detects a motif (12 bits) that does not overlap the known binding site, while MEMERIS identifies the binding site, although the respective motif is noticeable weaker (10 bits). The known binding sites and the predicted motifs are highlighted in blue. The motif length was set to 6 nt. For MEMERIS, the PU values were used with a pseudocount of 0.01.
Figure 8
Figure 8
Results of MEME and MEMERIS for the SLDE Rfam (RF00183) dataset. The figure shows the consensus sequence and structure of the SLDE element. The hairpin loop of the essential third stem is bound by an unknown protein factor (A). MEME detects a CAG motif which does not overlap the binding site (B). In contrast, MEMERIS identifies the TAT sequence of the hairpin loop as the motif (C). Both motif matrices have an information content of 6 bits. The known binding sites and the predicted motifs are highlighted in blue. The motif length was set to 3 nt. For MEMERIS, the PU values were used with a pseudocount of 0.01.

Similar articles

Cited by

References

    1. Mignone F., Gissi C., Liuni S., Pesole G. Untranslated regions of mRNAs. Genome Biol. 2002;3 - PMC - PubMed
    1. Messias A.C., Sattler M. Structural basis of single-stranded RNA recognition. Acc. Chem. Res. 2004;37:279–287. - PubMed
    1. Hall K.B. RNA-protein interactions. Curr. Opin. Struct. Biol. 2002;12:283–288. - PubMed
    1. Hori T., Taguchi Y., Uesugi S., Kurihara Y. The RNA ligands for mouse proline-rich RNA-binding protein (mouse Prrp) contain two consensus sequences in separate loop structure. Nucleic Acids Res. 2005;33:190–200. - PMC - PubMed
    1. Thisted T., Lyakhov D.L., Liebhaber S.A. Optimized RNA targets of two closely related triple KH domain proteins, heterogeneous nuclear ribonucleoprotein K and alphaCP-2KL, suggest distinct modes of RNA recognition. J. Biol. Chem. 2001;276:17484–17496. - PubMed

Publication types