An enhanced RNA alignment benchmark for sequence alignment programs

Andreas Wilm¹, Indra Mainz, Gerhard Steger

Affiliations

Affiliation

¹ Institut für Physikalische Biologie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr, 1, 40225 Düsseldorf, Germany. wilm@biophys.uni-duesseldorf.de

PMID: 17062125
PMCID: PMC1635699
DOI: 10.1186/1748-7188-1-19

An enhanced RNA alignment benchmark for sequence alignment programs

Andreas Wilm et al. Algorithms Mol Biol. 2006.

. 2006 Oct 24:1:19.

doi: 10.1186/1748-7188-1-19.

Authors

Andreas Wilm¹, Indra Mainz, Gerhard Steger

Affiliation

¹ Institut für Physikalische Biologie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr, 1, 40225 Düsseldorf, Germany. wilm@biophys.uni-duesseldorf.de

PMID: 17062125
PMCID: PMC1635699
DOI: 10.1186/1748-7188-1-19

Abstract

Background: The performance of alignment programs is traditionally tested on sets of protein sequences, of which a reference alignment is known. Conclusions drawn from such protein benchmarks do not necessarily hold for the RNA alignment problem, as was demonstrated in the first RNA alignment benchmark published so far. For example, the twilight zone - the similarity range where alignment quality drops drastically - starts at 60 % for RNAs in comparison to 20 % for proteins. In this study we enhance the previous benchmark.

Results: The RNA sequence sets in the benchmark database are taken from an increased number of RNA families to avoid unintended impact by using only a few families. The size of sets varies from 2 to 15 sequences to assess the influence of the number of sequences on program performance. Alignment quality is scored by two measures: one takes into account only nucleotide matches, the other measures structural conservation. The performance order of parameters--like nucleotide substitution matrices and gap-costs--as well as of programs is rated by rank tests.

Conclusion: Most sequence alignment programs perform equally well on RNA sequence sets with high sequence identity, that is with an average pairwise sequence identity (APSI) above 75 %. Parameters for gap-open and gap-extension have a large influence on alignment quality lower than APSI < or = 75 %; optimal parameter combinations are shown for several programs. The use of different 4 x 4 substitution matrices improved program performance only in some cases. The performance of iterative programs drastically increases with increasing sequence numbers and/or decreasing sequence identity, which makes them clearly superior to programs using a purely non-iterative, progressive approach. The best sequence alignment programs produce alignments of high quality down to APSI > 55 %; at lower APSI the use of sequence+structure alignment programs is recommended.

PubMed Disclaimer

Figures

**Figure 1**
**MAFFT (FFT-NS-2) and ClustalW performance with optimized and old parameters**. PROALIGN (earlier identified to be a good aligner [22]) is included as a reference. Performance is measured as BRALISCORE vs. reference APSI and exemplified for k = 5 sequences. MAFFT version 5.667 was used with optimized parameters, which are default in version 5.667, and with (old) parameters of version 4, respectively; CLUSTALW was used either with default parameters or with optimized parameters (see Table 2 and text).

**Figure 2**
**Performance of Prrn compared to ClustalW in dependence on sequence number per alignment**. The plot shows the difference of the scores of PRRN as a representative of an iterative alignment approach and CLUSTALW (standard options) as a representative of a progressive approach.

**Figure 3**
**Lowess smoothing**. The plot shows the scattered data points, each corresponding to one alignment, exemplified by the performance of PROALIGN with k = 7 sequences per alignment. The curve is the result of a lowess smoothing with a smoothing factor of 0.3.

See this image and copyright information in PMC

References

1. Sankoff D. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J Appl Math. 1985;45:810–825. doi: 10.1137/0145048. - DOI
1. Mathews DH. Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics. 2005;21:2246–2253. doi: 10.1093/bioinformatics/bti349. - DOI - PubMed
1. Havgaard JH, Lyngso RB, Stormo GD, Gorodkin J. Pairwise local structural alignment of RNA sequences with sequence similarity less than 40 % Bioinformatics. 2005;21:1815–1824. doi: 10.1093/bioinformatics/bti279. - DOI - PubMed
1. Hofacker IL, Bernhart SHF, Stadler PF. Alignment of RNA base pairing probability matrices. Bioinformatics. 2004;20:2222–2227. doi: 10.1093/bioinformatics/bth229. - DOI - PubMed
1. Holmes I. Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics. 2005;6:73. doi: 10.1186/1471-2105-6-73. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An enhanced RNA alignment benchmark for sequence alignment programs

Affiliation

An enhanced RNA alignment benchmark for sequence alignment programs

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous