. 2017 Jul 15;33(14):2089-2096.

doi: 10.1093/bioinformatics/btx114.

RNAscClust: clustering RNA sequences using structure conservation and graph based motifs

Milad Miladi¹, Alexander Junge^{2

3}, Fabrizio Costa¹, Stefan E Seemann^{2

3}, Jakob Hull Havgaard^{2

3}, Jan Gorodkin^{2

3}, Rolf Backofen^{1

2

4}

Affiliations

¹ Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany.
² Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark.
³ Department of Veterinary and Animal Sciences, University of Copenhagen, Frederiksberg, Denmark.
⁴ Center for Biological Signalling Studies (BIOSS), Cluster of Excellence, University of Freiburg, Freiburg im Breisgau, Germany.

PMID: 28334186
PMCID: PMC5870858
DOI: 10.1093/bioinformatics/btx114

RNAscClust: clustering RNA sequences using structure conservation and graph based motifs

Milad Miladi et al. Bioinformatics. 2017.

. 2017 Jul 15;33(14):2089-2096.

doi: 10.1093/bioinformatics/btx114.

Authors

Milad Miladi¹, Alexander Junge^{2

3}, Fabrizio Costa¹, Stefan E Seemann^{2

3}, Jakob Hull Havgaard^{2

3}, Jan Gorodkin^{2

3}, Rolf Backofen^{1

2

4}

Affiliations

¹ Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany.
² Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark.
³ Department of Veterinary and Animal Sciences, University of Copenhagen, Frederiksberg, Denmark.
⁴ Center for Biological Signalling Studies (BIOSS), Cluster of Excellence, University of Freiburg, Freiburg im Breisgau, Germany.

PMID: 28334186
PMCID: PMC5870858
DOI: 10.1093/bioinformatics/btx114

Abstract

Motivation: Clustering RNA sequences with common secondary structure is an essential step towards studying RNA function. Whereas structural RNA alignment strategies typically identify common structure for orthologous structured RNAs, clustering seeks to group paralogous RNAs based on structural similarities. However, existing approaches for clustering paralogous RNAs, do not take the compensatory base pair changes obtained from structure conservation in orthologous sequences into account.

Results: Here, we present RNAscClust , the implementation of a new algorithm to cluster a set of structured RNAs taking their respective structural conservation into account. For a set of multiple structural alignments of RNA sequences, each containing a paralog sequence included in a structural alignment of its orthologs, RNAscClust computes minimum free-energy structures for each sequence using conserved base pairs as prior information for the folding. The paralogs are then clustered using a graph kernel-based strategy, which identifies common structural features. We show that the clustering accuracy clearly benefits from an increasing degree of compensatory base pair changes in the alignments.

Availability and implementation: RNAscClust is available at http://www.bioinf.uni-freiburg.de/Software/RNAscClust .

Contact: gorodkin@rth.dk or backofen@informatik.uni-freiburg.de.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Hypothetical example to illustrate the difference between single sequence clustering and clustering using conserved structure. Assume the indicated G-U base pair between the first and last nucleotide in the left-most blue human sequence is part of its correct secondary structure. While single sequence structure prediction (a) fails to predict the G-U pair, information about covariation contained in the alignment (b) yields the correct secondary structure for the human sequence and allows to emphasize covarying base pairs. Taking covariation and conserved structures into account may thus yield an improved clustering

**Fig. 2.**
Representing the constrained folded secondary structure as a graph and feature extraction. (a) Base pairs with a reliability greater than t are set as structure constraints (blue boxes) derived from the alignment consensus structure. A constrained secondary structure prediction is performed for the human sequence, the organism of interest in this example. Plain and, if enough covarying base pairs are found, decorated secondary structure are represented as graphs. (b) Auxiliary vertices (gray) are added to the secondary structure graph to emphasize stacked base pairs. The secondary structure is decomposed into substructures using a graph kernel. Here, only neighborhood subgraphs for $N_{1}^{v}$ and $v = 1, \dots, 6$ are shown and d = 0 which results in the extraction of single root vertices instead of root vertex pairs. The hashing function H encodes each subgraph as an integer which in turn becomes the index of the subgraph in the sparse feature vector counting subgraph occurrences. Since , the feature is counted twice while the other neighborhood subgraphs are unique. The feature extraction for N-N decorated structures is implemented the same way (Color version of this figure is available at *Bioinformatics* online.)

formula image — **Fig. 2.**
Representing the constrained folded secondary structure as a graph and feature extraction. (a) Base pairs with a reliability greater than t are set as structure constraints (blue boxes) derived from the alignment consensus structure. A constrained secondary structure prediction is performed for the human sequence, the organism of interest in this example. Plain and, if enough covarying base pairs are found, decorated secondary structure are represented as graphs. (b) Auxiliary vertices (gray) are added to the secondary structure graph to emphasize stacked base pairs. The secondary structure is decomposed into substructures using a graph kernel. Here, only neighborhood subgraphs for $N_{1}^{v}$ and $v = 1, \dots, 6$ are shown and d = 0 which results in the extraction of single root vertices instead of root vertex pairs. The hashing function H encodes each subgraph as an integer which in turn becomes the index of the subgraph in the sparse feature vector counting subgraph occurrences. Since , the feature is counted twice while the other neighborhood subgraphs are unique. The feature extraction for N-N decorated structures is implemented the same way (Color version of this figure is available at *Bioinformatics* online.)

**Fig. 3.**
An overview of RNAscClust with steps executed in parallel shown as stacks. The sequence of interest in each alignment is constraint folded to generate a conservation aware secondary structure (steps 1 & 2). The resulting secondary structure graph is transformed into a feature vector in step 3. Using local sensitivity hashing (step 4), candidate clusters are extracted in the fifth step. Then a series of post-processing steps as implemented in GraphClust are invoked (steps 6–8): sequences of each cluster are aligned using LocARNA and only well aligning sequences are retained. A covariance model is generated with Infernal to extend clusters with sequences matching the model. Steps 6–8 are repeated until all candidate clusters are processed

**Fig. 4.**
RNAscClust and GraphClust clustering performances, measured by the Adjusted Rand Index, depending on the mean of the alignment-wise mean pairwise sequence identity (mean PSI) of the *Rfam-cliques Low*, *Medium* and *High* (A) as well as *Rfam-ome* (B) benchmark sets

See this image and copyright information in PMC

References

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
1. Backofen R., Hess W.R. (2010) Computational prediction of sRNAs and their targets in bacteria. RNA Biol., 7, 33–42. - PubMed
1. Broder A.Z. (1997). On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997 (Proceedings), pp. 21–29.
1. Costa F., De Grave K. (2010). Fast neighborhood subgraph pairwise distance kernel. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, pp. 255–262. Omnipress.
1. Fu Y. et al. (2014) Dynalign II: common secondary structure prediction for RNA homologs with domain insertions. Nucleic Acids Res., 42, 13939–13948. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RNAscClust: clustering RNA sequences using structure conservation and graph based motifs

Affiliations

RNAscClust: clustering RNA sequences using structure conservation and graph based motifs

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources