Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Apr 22:12:116.
doi: 10.1186/1471-2105-12-116.

Ultra-fast sequence clustering from similarity networks with SiLiX

Affiliations

Ultra-fast sequence clustering from similarity networks with SiLiX

Vincent Miele et al. BMC Bioinformatics. .

Abstract

Background: The number of gene sequences that are available for comparative genomics approaches is increasing extremely quickly. A current challenge is to be able to handle this huge amount of sequences in order to build families of homologous sequences in a reasonable time.

Results: We present the software package SiLiX that implements a novel method which reconsiders single linkage clustering with a graph theoretical approach. A parallel version of the algorithms is also presented. As a demonstration of the ability of our software, we clustered more than 3 millions sequences from about 2 billion BLAST hits in 7 minutes, with a high clustering quality, both in terms of sensitivity and specificity.

Conclusions: Comparing state-of-the-art software, SiLiX presents the best up-to-date capabilities to face the problem of clustering large collections of sequences. SiLiX is freely available at http://lbbe.univ-lyon1.fr/SiLiX.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Single linkage clustering with alignment coverage constraints. The four proteins (A, B, C, D) contain some homologous domains (represented by colored boxes). To avoid the clustering in the same family of proteins that do not share any homology (e.g. A and D), pairwise sequence alignments are considered for the clustering only if they cover a minimum threshold of the length of each of the two proteins. This threshold has to be high enough to exclude cases like the alignment (B, C), which would lead to the clustering of A and D.
Figure 2
Figure 2
An example of the steps involved in the algorithm called Union-Find by rank with path compression [19,20]. Edges (first column, in red) are examined online. The disjoint-sets data structure, represented by trees (third column) and implemented using the parent array (second column), is consequently modified. The two vertices of the current edge of interest are colored in red.
Figure 3
Figure 3
CPU time of the parallelized version of SiLiX(plain) according to the number of processors on the dataset of similarity pairs extracted from the HOGENOM database [9], compared with theoretical values (dashed). Run on a cluster of 2 octo-bicore Opteron 2.8 Ghz and 2 octo-quadcore Opteron 2.3 GHz.
Figure 4
Figure 4
Clustering performance evaluation based on InterPro classification. a) specificity, b) sensitivity and c) Jaccard coefficient of SiLiX, used on similarity pairs extracted from the HOGENOM database, with different values of threshold on the percentage of sequence identity and alignment coverage.

References

    1. Petryszak R, Kretschmann E, Wieser D, Apweiler R. The predictive power of the CluSTr database. Bioinformatics. 2005;21:3604–3609. doi: 10.1093/bioinformatics/bti542. - DOI - PubMed
    1. Meinel T, Krause A, Luz H, Vingron M, Staub E. The SYSTERS Protein Family Database in 2005. Nucleic Acids Res. 2005;33:226–229. doi: 10.1093/nar/gki471. - DOI - PMC - PubMed
    1. Dehal PS, Boore JL. A phylogenomic gene cluster resource: the Phylogenetically Inferred Groups (PhIGs) database. BMC Bioinformatics. 2006;7 - PMC - PubMed
    1. Li H, Coghlan A, Ruan J, Coin LJ, Heriche JK, Osmotherly L, Li R, Liu T, Zhang Z, Bolund L, Wong GK, Zheng W, Dehal P, Wang J, Durbin R. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 2006;34:572–580. doi: 10.1093/nar/gkj118. - DOI - PMC - PubMed
    1. Hartmann S, Lu D, Phillips J, Vision TJ. Phytome: a platform for plant comparative genomics. Nucleic Acids Res. 2006;34:724–730. doi: 10.1093/nar/gkj045. - DOI - PMC - PubMed

Publication types