Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 15;33(14):2089-2096.
doi: 10.1093/bioinformatics/btx114.

RNAscClust: clustering RNA sequences using structure conservation and graph based motifs

Affiliations

RNAscClust: clustering RNA sequences using structure conservation and graph based motifs

Milad Miladi et al. Bioinformatics. .

Abstract

Motivation: Clustering RNA sequences with common secondary structure is an essential step towards studying RNA function. Whereas structural RNA alignment strategies typically identify common structure for orthologous structured RNAs, clustering seeks to group paralogous RNAs based on structural similarities. However, existing approaches for clustering paralogous RNAs, do not take the compensatory base pair changes obtained from structure conservation in orthologous sequences into account.

Results: Here, we present RNAscClust , the implementation of a new algorithm to cluster a set of structured RNAs taking their respective structural conservation into account. For a set of multiple structural alignments of RNA sequences, each containing a paralog sequence included in a structural alignment of its orthologs, RNAscClust computes minimum free-energy structures for each sequence using conserved base pairs as prior information for the folding. The paralogs are then clustered using a graph kernel-based strategy, which identifies common structural features. We show that the clustering accuracy clearly benefits from an increasing degree of compensatory base pair changes in the alignments.

Availability and implementation: RNAscClust is available at http://www.bioinf.uni-freiburg.de/Software/RNAscClust .

Contact: gorodkin@rth.dk or backofen@informatik.uni-freiburg.de.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Hypothetical example to illustrate the difference between single sequence clustering and clustering using conserved structure. Assume the indicated G-U base pair between the first and last nucleotide in the left-most blue human sequence is part of its correct secondary structure. While single sequence structure prediction (a) fails to predict the G-U pair, information about covariation contained in the alignment (b) yields the correct secondary structure for the human sequence and allows to emphasize covarying base pairs. Taking covariation and conserved structures into account may thus yield an improved clustering
Fig. 2.
Fig. 2.
Representing the constrained folded secondary structure as a graph and feature extraction. (a) Base pairs with a reliability greater than t are set as structure constraints (blue boxes) derived from the alignment consensus structure. A constrained secondary structure prediction is performed for the human sequence, the organism of interest in this example. Plain and, if enough covarying base pairs are found, decorated secondary structure are represented as graphs. (b) Auxiliary vertices (gray) are added to the secondary structure graph to emphasize stacked base pairs. The secondary structure is decomposed into substructures using a graph kernel. Here, only neighborhood subgraphs for N1v and v=1,,6 are shown and d = 0 which results in the extraction of single root vertices instead of root vertex pairs. The hashing function H encodes each subgraph as an integer which in turn becomes the index of the subgraph in the sparse feature vector counting subgraph occurrences. Since formula image, the feature is counted twice while the other neighborhood subgraphs are unique. The feature extraction for N-N decorated structures is implemented the same way (Color version of this figure is available at Bioinformatics online.)
Fig. 3.
Fig. 3.
An overview of RNAscClust with steps executed in parallel shown as stacks. The sequence of interest in each alignment is constraint folded to generate a conservation aware secondary structure (steps 1 & 2). The resulting secondary structure graph is transformed into a feature vector in step 3. Using local sensitivity hashing (step 4), candidate clusters are extracted in the fifth step. Then a series of post-processing steps as implemented in GraphClust are invoked (steps 6–8): sequences of each cluster are aligned using LocARNA and only well aligning sequences are retained. A covariance model is generated with Infernal to extend clusters with sequences matching the model. Steps 6–8 are repeated until all candidate clusters are processed
Fig. 4.
Fig. 4.
RNAscClust and GraphClust clustering performances, measured by the Adjusted Rand Index, depending on the mean of the alignment-wise mean pairwise sequence identity (mean PSI) of the Rfam-cliques Low, Medium and High (A) as well as Rfam-ome (B) benchmark sets

Similar articles

Cited by

References

    1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
    1. Backofen R., Hess W.R. (2010) Computational prediction of sRNAs and their targets in bacteria. RNA Biol., 7, 33–42. - PubMed
    1. Broder A.Z. (1997). On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997 (Proceedings), pp. 21–29.
    1. Costa F., De Grave K. (2010). Fast neighborhood subgraph pairwise distance kernel. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, pp. 255–262. Omnipress.
    1. Fu Y. et al. (2014) Dynalign II: common secondary structure prediction for RNA homologs with domain insertions. Nucleic Acids Res., 42, 13939–13948. - PMC - PubMed