SynerClust: a highly scalable, synteny-aware orthologue clustering tool

Christophe H Georgescu¹, Abigail L Manson¹, Alexander D Griggs¹, Christopher A Desjardins¹, Alejandro Pironti¹, Ilan Wapinski², Thomas Abeel^{1

3}, Brian J Haas¹, Ashlee M Earl¹

Affiliations

¹ 1Broad Institute, Cambridge, MA, USA.
² 2enEvolv, Boston, MA, USA.
³ 3Delft University of Technology, Delft, The Netherlands.

PMID: 30418868
PMCID: PMC6321874
DOI: 10.1099/mgen.0.000231

SynerClust: a highly scalable, synteny-aware orthologue clustering tool

Christophe H Georgescu et al. Microb Genom. 2018 Nov.

. 2018 Nov;4(11):e000231.

doi: 10.1099/mgen.0.000231. Epub 2018 Nov 12.

Authors

Christophe H Georgescu¹, Abigail L Manson¹, Alexander D Griggs¹, Christopher A Desjardins¹, Alejandro Pironti¹, Ilan Wapinski², Thomas Abeel^{1

3}, Brian J Haas¹, Ashlee M Earl¹

Affiliations

¹ 1Broad Institute, Cambridge, MA, USA.
² 2enEvolv, Boston, MA, USA.
³ 3Delft University of Technology, Delft, The Netherlands.

PMID: 30418868
PMCID: PMC6321874
DOI: 10.1099/mgen.0.000231

Abstract

Accurate orthologue identification is a vital component of bacterial comparative genomic studies, but many popular sequence-similarity-based approaches do not scale well to the large numbers of genomes that are now generated routinely. Furthermore, most approaches do not take gene synteny into account, which is useful information for disentangling paralogues. Here, we present SynerClust, a user-friendly synteny-aware tool based on synergy that can process thousands of genomes. SynerClust was designed to analyse genomes with high levels of local synteny, particularly prokaryotes, which have operon structure. SynerClust's run-time is optimized by selecting cluster representatives at each node in the phylogeny; thus, avoiding the need for exhaustive pairwise similarity searches. In benchmarking against Roary, Hieranoid2, PanX and Reciprocal Best Hit, SynerClust was able to more completely identify sets of core genes for datasets that included diverse strains, while using substantially less memory, and with scalability comparable to the fastest tools. Due to its scalability, ease of installation and use, and suitability for a variety of computing environments, orthogroup clustering using SynerClust will enable many large-scale prokaryotic comparative genomics efforts.

Keywords: comparative genomics; orthogroup clustering; orthologues; synteny.

PubMed Disclaimer

Conflict of interest statement

The authors declare that there are no conflicts of interest.

Figures

**Fig. 1.**
Overview of the SynerClust algorithm. (a) Input phylogeny: example of a phylogenetic guide tree. SynerClust traverses the input phylogeny from the leaves to the root, iteratively computing sequence similarity and synteny, combining information from the children of each internal node. First, leaves B and C (children) are processed at internal node G (parent). Second, node G and leaf A are processed at internal node J. This second step is used as an example in the algorithm explanation below. (b) Initial clustering for node J: initial clusters of orthogroups are constructed from blast+ results between representative sequences of child orthogroups. A lenient cut-off (E value 1×10⁻⁵) is used, and hits with at least 80 % identity to the best hit are kept. After filtering, only reciprocal hits are used to build a graph from which each set of connected orthogroups becomes a cluster (orange groups). (c) Calculation of syntenic fraction: a syntenic fraction for a specific orthogroup (orthogroup coloured in black) is calculated by dividing the number of shared neighbours within a 6 kb distance window (coloured in purple or red) by the total number of neighbours between two genomes (shared or unshared). For each cluster, a syntenic similarity matrix is built using the mean of all pairwise syntenic fractions. (d) Final clustering: final orthogroups for the current parent node are defined from the initial clusters by first looking for highly syntenic pairs, then for remaining pairs of reciprocal best hits. Child orthogroups that remain unmerged are marked as paralogues (potential inparalogues) of their best hit. At the next node, if they are still not part of an orthogroup, the mark is kept; otherwise it is removed. (e) Representative selection: for each parent orthogroup, representative sequences from child orthogroups are aligned (using muscle [35]) and used to build a tree (using FastTree2 [21]). Groups of highly similar sequences are defined by applying a sequence similarity threshold (red boxes). The longest sequence is then selected as a representative for all other sequences within a set mutational distance. This is repeated by selecting additional representatives until all sequences are represented.

**Fig. 2.**
SynerClust runs fast and uses less memory than other tools. (a) Run-times indicate estimated CPU time (for details see Table S2). (b) Memory usage value indicated is the peak value.

**Fig. 3.**
Consistency of function within SynerClust orthogroups is similar to that of other methods. Scoring metrics for different tools on the *E. coli* dataset: mean Schlicker EC score, mean Schlicker GO score, kegg orthology Jaccard similarity, kegg pathway Jaccard similarity and Pfam Jaccard similarity. ‘Pairs’ indicates that a mean is taken over all pairwise combinations, whereas ‘clusters’ indicates a mean over the clusters. Error bars represent the sd. Similar results are seen for the *Enterobacteriaceae* dataset (Fig. S3).

**Fig. 4.**
SynerClust consistently identifies a large SCC across all datasets. The numbers of SCC and multi copy core orthogroups identified by each method are shown. We did not run Hieranoid2, PanX or rbh on the *M. tuberculosis* dataset because these methods do not scale well enough to run on datasets of this size.

See this image and copyright information in PMC

References

1. Salichos L, Rokas A. Evaluating ortholog prediction algorithms in a yeast model clade. PLoS One. 2011;6:e18755. doi: 10.1371/journal.pone.0018755. - DOI - PMC - PubMed
1. Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. doi: 10.1101/gr.1224503. - DOI - PMC - PubMed
1. Fouts DE, Brinkac L, Beck E, Inman J, Sutton G. PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res. 2012;40:e172. doi: 10.1093/nar/gks757. - DOI - PMC - PubMed
1. Zhao Y, Wu J, Yang J, Sun S, Xiao J, et al. PGAP: pan-genomes analysis pipeline. Bioinformatics. 2012;28:416–418. doi: 10.1093/bioinformatics/btr655. - DOI - PMC - PubMed
1. Sonnhammer EL, Gabaldón T, Sousa da Silva AW, Martin M, Robinson-Rechavi M, et al. Big data and other challenges in the quest for orthologs. Bioinformatics. 2014;30:2993–2998. doi: 10.1093/bioinformatics/btu492. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

U19 AI110818/AI/NIAID NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SynerClust: a highly scalable, synteny-aware orthologue clustering tool

Affiliations

SynerClust: a highly scalable, synteny-aware orthologue clustering tool

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources