Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec 31;15(1):404.
doi: 10.1186/s12859-014-0404-0.

ExpaRNA-P: simultaneous exact pattern matching and folding of RNAs

Affiliations

ExpaRNA-P: simultaneous exact pattern matching and folding of RNAs

Christina Otto et al. BMC Bioinformatics. .

Abstract

Background: Identifying sequence-structure motifs common to two RNAs can speed up the comparison of structural RNAs substantially. The core algorithm of the existent approach ExpaRNA solves this problem for a priori known input structures. However, such structures are rarely known; moreover, predicting them computationally is no rescue, since single sequence structure prediction is highly unreliable.

Results: The novel algorithm ExpaRNA-P computes exactly matching sequence-structure motifs in entire Boltzmann-distributed structure ensembles of two RNAs; thereby we match and fold RNAs simultaneously, analogous to the well-known "simultaneous alignment and folding" of RNAs. While this implies much higher flexibility compared to ExpaRNA, ExpaRNA-P has the same very low complexity (quadratic in time and space), which is enabled by its novel structure ensemble-based sparsification. Furthermore, we devise a generalized chaining algorithm to compute compatible subsets of ExpaRNA-P's sequence-structure motifs. Resulting in the very fast RNA alignment approach ExpLoc-P, we utilize the best chain as anchor constraints for the sequence-structure alignment tool LocARNA. ExpLoc-P is benchmarked in several variants and versus state-of-the-art approaches. In particular, we formally introduce and evaluate strict and relaxed variants of the problem; the latter makes the approach sensitive to compensatory mutations. Across a benchmark set of typical non-coding RNAs, ExpLoc-P has similar accuracy to LocARNA but is four times faster (in both variants), while it achieves a speed-up over 30-fold for the longest benchmark sequences (≈400nt). Finally, different ExpLoc-P variants enable tailoring of the method to specific application scenarios. ExpaRNA-P and ExpLoc-P are distributed as part of the LocARNA package. The source code is freely available at http://www.bioinf.uni-freiburg.de/Software/ExpaRNA-P .

Conclusions: ExpaRNA-P's novel ensemble-based sparsification reduces its complexity to quadratic time and space. Thereby, ExpaRNA-P significantly speeds up sequence-structure alignment while maintaining the alignment quality. Different ExpaRNA-P variants support a wide range of applications.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The ExpLoc-P pipeline. Using EPMs as anchor constraints to speed up RNA structure alignments.
Figure 2
Figure 2
Visualization of the pattern matching definition. Two different illustrations of the notion pattern matching are shown in (A) and (B). For the light gray pattern matching, we have ={22,33,913,1014,1115} and S={(2,112,15),(3,103,14)}. Note that the two separated regions in both sequences are connected through base pairs. Furthermore, the set of structure matches is |S={22,33,1014,1115} and the set of sequence matches is |S={913}. The pattern matching can be extended by the base pair match shown in dark gray, i.e., ={11,1216} and S=S{(1,121,16)}. (,S), the EPM shown in light gray, is a strict EPM, whereas (,S), the EPM extended by the dark gray part, is a relaxed EPM as mismatches in structure matches occur. (C) shows an example of an invalid matching. Separately, both the small and the big matched gray parts are valid EPMs, but together they do not form a valid EPM as the two individual parts are not connected.
Figure 3
Figure 3
Visualization of maximal EPMs. Matches of green bases refer to exact matches and red ones to inexact (structure) matches. (A-C) EPM A is not maximal since there exists a larger (strict) EPM (B or C). EPMs B and C can be maximal simultaneously since in each case some base matches have different parents. (A and D-F) EPM D is generated from A by appending an inexact structure match and has a lower score than A. Further extending the EPM leads to higher scores again (E and F). D is not maximal since A has the same parents and a higher score. A is not maximal because there exist (relaxed) EPMs E and F with the same parents and higher scores. Among A, D, E, and F, only F is maximal.
Figure 4
Figure 4
Recursion equations. Recursions for computing the significant strict EPMs and relaxed EPMs, respectively. These equations are visualized in Figure 5.
Figure 5
Figure 5
Recursion visualization. Visualization of the recursions to compute the matrix entries Lijkl(j,l),GAijkl(j,l),GABijkl(j,l),LRijkl(j,l),D(ij,kl),F(j,l) and the auxiliary matrix H ijkl(j ,l ).
Figure 6
Figure 6
Comparison of ExpLoc-P variants. (A) Alignment quality (SPS) vs. sequence identity. (B) Coverage vs. sequence identity. Dependencies are visualized as LOWESS curves.
Figure 7
Figure 7
Comparison with sequence-structure alignment methods. (A) Comparison of alignment quality vs. sequence identity. (B) Comparison of speedup over LocARNA vs. length (LOWESS).

Similar articles

Cited by

References

    1. The FANTOM Consortium The transcriptional landscape of the mammalian genome. Science. 2005;309(5740):1559–63. doi: 10.1126/science.1112014. - DOI - PubMed
    1. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, Sementchenko V, Piccolboni A, Bekiranov S, Bailey DK, Ganesh M, Ghosh S, Bell I, Gerhard DS, Gingeras TR. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 2005;308:1149–1154. doi: 10.1126/science.1108625. - DOI - PubMed
    1. Bertone P, Stoc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, Gerstein M, Snyder M. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004;306:2242–2246. doi: 10.1126/science.1103388. - DOI - PubMed
    1. The ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. - DOI - PMC - PubMed
    1. Mattick JS, Taft RJ, Faulkner GJ. A global view of genomic information - moving beyond the gene and the master regulator. Trends Genet. 2010;26(1):21–8. doi: 10.1016/j.tig.2009.11.002. - DOI - PubMed

Publication types