Query-dependent banding (QDB) for faster RNA similarity searches

Eric P Nawrocki¹, Sean R Eddy

Affiliations

PMID: 17397253
PMCID: PMC1847999
DOI: 10.1371/journal.pcbi.0030056

Query-dependent banding (QDB) for faster RNA similarity searches

Eric P Nawrocki et al. PLoS Comput Biol. 2007.

. 2007 Mar 30;3(3):e56.

doi: 10.1371/journal.pcbi.0030056. Epub 2007 Feb 7.

Authors

Eric P Nawrocki¹, Sean R Eddy

Affiliation

¹ Howard Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America.

PMID: 17397253
PMCID: PMC1847999
DOI: 10.1371/journal.pcbi.0030056

Abstract

When searching sequence databases for RNAs, it is desirable to score both primary sequence and RNA secondary structure similarity. Covariance models (CMs) are probabilistic models well-suited for RNA similarity search applications. However, the computational complexity of CM dynamic programming alignment algorithms has limited their practical application. Here we describe an acceleration method called query-dependent banding (QDB), which uses the probabilistic query CM to precalculate regions of the dynamic programming lattice that have negligible probability, independently of the target database. We have implemented QDB in the freely available Infernal software package. QDB reduces the average case time complexity of CM alignment from LN(2.4) to LN(1.3) for a query RNA of N residues and a target database of L residues, resulting in a 4-fold speedup for typical RNA queries. Combined with other improvements to Infernal, including informative mixture Dirichlet priors on model parameters, benchmarks also show increased sensitivity and specificity resulting from improved parameterization.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

**Figure 1. An Example RNA Family and Corresponding CM**
(A) A toy multiple alignment of three RNA sequences, with 28 total columns, 24 of which will be modeled as consensus positions. The [structure] line annotates the consensus secondary structure: angle brackets mark base pairs, colons mark consensus single-stranded positions, and periods mark “insert” columns that will not be considered part of the consensus model because more than half the sequences in these columns contain gaps. (B) The structure of one sequence from (A), the same structure with positions numbered according to alignment columns, and the guide tree of nodes corresponding to that structure, with alignment column indices assigned to nodes (for example, node 5, a MATP match-pair node, will model the consensus base pair between columns 4 and 14). (C) The state topology of three selected nodes of the CM, for two MATP nodes and one consensus “leftwise” single residue bulge node (MATL, “match-left”). The consensus pair and singlet states (two MPs and one ML) are white, and the insertion/deletion states are gray. State transitions are indicated by arrows.

**Figure 2. Effect of Transition Priors on Band Calculation**
Predicted and actual target lengths are shown for three CMs built from alignments of five transfer RNA, 5S rRNA, and RNaseP sequences, which are about 75, 120, and 380 residues long, respectively. Solid vertical lines are histogram bars of the actual lengths of the query sequences in each alignment, corresponding with the right vertical axis labels. Dashed and dotted curves show QDB calculations for γ₀(d) for the root state of each model, for uninformative versus informative Dirichlet priors, respectively. Dashed and dotted vertical lines show the band bounds [dmin(0) (left) and dmax(0) (right)] derived from the γ₀(d) distributions using β = 10⁻⁷. The uninformative plus-1 prior results in consistent underprediction of target sequence lengths, with a broad distribution. The new informative priors produce tighter distributions that are centered on the actual subsequence lengths. We observe the same result for all other states (unpublished data).

**Figure 3. Effect of Varying the β Parameter on Sensitivity, Specificity, and Speedup**

**Figure 4. CPU Time Required by CM Searches with and without QDB**
The time required for searching the 1-Mb target pseudo-genome with each of the 51 benchmark models is shown as a point, plotted on a log–log graph as a function of the average length of the RNA sequences in the query alignment; open circles are without QDB, and filled circles are with QDB (with the default β = 10⁻⁷). Lines represent fits to a power law (*aN^b*), showing that for a fixed L = 1-Mb target database size, the standard CYK algorithm empirically scales as N ^2.36, and the QDB algorithm scales as N ^1.32. The apparent intersection of the linear fitted lines is deceptive. At small query lengths, run time is dominated by factors other than the CM alignment computation, such as i/o. QDB searches are always faster than nonbanded searches even for synthetic tiny queries of fewer than ten nucleotides (unpublished data).

**Figure 5. ROC Curves for the Benchmark**
Plots are shown for the new Infernal 0.72 with and without QDB, for the old Infernal 0.55, and for family-pairwise searches (FPS) with BlastN.

See this image and copyright information in PMC

References

1. Lowe TM, Eddy SR. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–964. - PMC - PubMed
1. Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004;32:11–16. - PMC - PubMed
1. Lowe TM, Eddy SR. A computational screen for methylation guide snoRNAs in yeast. Science. 1999;283:1168–1171. - PubMed
1. Schattner P, Barberan-Soler S, Lowe TM. A computational screen for mammalian pseudouridylation guide H/ACA RNAs. RNA. 2006;12:15–25. - PMC - PubMed
1. Lai EC, Tomancak P, Williams RW, Rubin GM. Computational identification of Drosophila microRNA genes. Genome Biol. 2003;4:R42. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Query-dependent banding (QDB) for faster RNA similarity searches

Affiliation

Query-dependent banding (QDB) for faster RNA similarity searches

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources