Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Mar 30;3(3):e56.
doi: 10.1371/journal.pcbi.0030056. Epub 2007 Feb 7.

Query-dependent banding (QDB) for faster RNA similarity searches

Affiliations

Query-dependent banding (QDB) for faster RNA similarity searches

Eric P Nawrocki et al. PLoS Comput Biol. .

Abstract

When searching sequence databases for RNAs, it is desirable to score both primary sequence and RNA secondary structure similarity. Covariance models (CMs) are probabilistic models well-suited for RNA similarity search applications. However, the computational complexity of CM dynamic programming alignment algorithms has limited their practical application. Here we describe an acceleration method called query-dependent banding (QDB), which uses the probabilistic query CM to precalculate regions of the dynamic programming lattice that have negligible probability, independently of the target database. We have implemented QDB in the freely available Infernal software package. QDB reduces the average case time complexity of CM alignment from LN(2.4) to LN(1.3) for a query RNA of N residues and a target database of L residues, resulting in a 4-fold speedup for typical RNA queries. Combined with other improvements to Infernal, including informative mixture Dirichlet priors on model parameters, benchmarks also show increased sensitivity and specificity resulting from improved parameterization.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. An Example RNA Family and Corresponding CM
(A) A toy multiple alignment of three RNA sequences, with 28 total columns, 24 of which will be modeled as consensus positions. The [structure] line annotates the consensus secondary structure: angle brackets mark base pairs, colons mark consensus single-stranded positions, and periods mark “insert” columns that will not be considered part of the consensus model because more than half the sequences in these columns contain gaps. (B) The structure of one sequence from (A), the same structure with positions numbered according to alignment columns, and the guide tree of nodes corresponding to that structure, with alignment column indices assigned to nodes (for example, node 5, a MATP match-pair node, will model the consensus base pair between columns 4 and 14). (C) The state topology of three selected nodes of the CM, for two MATP nodes and one consensus “leftwise” single residue bulge node (MATL, “match-left”). The consensus pair and singlet states (two MPs and one ML) are white, and the insertion/deletion states are gray. State transitions are indicated by arrows.
Figure 2
Figure 2. Effect of Transition Priors on Band Calculation
Predicted and actual target lengths are shown for three CMs built from alignments of five transfer RNA, 5S rRNA, and RNaseP sequences, which are about 75, 120, and 380 residues long, respectively. Solid vertical lines are histogram bars of the actual lengths of the query sequences in each alignment, corresponding with the right vertical axis labels. Dashed and dotted curves show QDB calculations for γ0(d) for the root state of each model, for uninformative versus informative Dirichlet priors, respectively. Dashed and dotted vertical lines show the band bounds [dmin(0) (left) and dmax(0) (right)] derived from the γ0(d) distributions using β = 10−7. The uninformative plus-1 prior results in consistent underprediction of target sequence lengths, with a broad distribution. The new informative priors produce tighter distributions that are centered on the actual subsequence lengths. We observe the same result for all other states (unpublished data).
Figure 3
Figure 3. Effect of Varying the β Parameter on Sensitivity, Specificity, and Speedup
Figure 4
Figure 4. CPU Time Required by CM Searches with and without QDB
The time required for searching the 1-Mb target pseudo-genome with each of the 51 benchmark models is shown as a point, plotted on a log–log graph as a function of the average length of the RNA sequences in the query alignment; open circles are without QDB, and filled circles are with QDB (with the default β = 10−7). Lines represent fits to a power law (aNb), showing that for a fixed L = 1-Mb target database size, the standard CYK algorithm empirically scales as N 2.36, and the QDB algorithm scales as N 1.32. The apparent intersection of the linear fitted lines is deceptive. At small query lengths, run time is dominated by factors other than the CM alignment computation, such as i/o. QDB searches are always faster than nonbanded searches even for synthetic tiny queries of fewer than ten nucleotides (unpublished data).
Figure 5
Figure 5. ROC Curves for the Benchmark
Plots are shown for the new Infernal 0.72 with and without QDB, for the old Infernal 0.55, and for family-pairwise searches (FPS) with BlastN.

References

    1. Lowe TM, Eddy SR. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–964. - PMC - PubMed
    1. Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004;32:11–16. - PMC - PubMed
    1. Lowe TM, Eddy SR. A computational screen for methylation guide snoRNAs in yeast. Science. 1999;283:1168–1171. - PubMed
    1. Schattner P, Barberan-Soler S, Lowe TM. A computational screen for mammalian pseudouridylation guide H/ACA RNAs. RNA. 2006;12:15–25. - PMC - PubMed
    1. Lai EC, Tomancak P, Williams RW, Rubin GM. Computational identification of Drosophila microRNA genes. Genome Biol. 2003;4:R42. - PMC - PubMed

Publication types