. 2016 Jun 4;16(1):120.

doi: 10.1186/s12862-016-0684-2.

GenFamClust: an accurate, synteny-aware and reliable homology inference algorithm

Raja H Ali¹, Sayyed A Muhammad¹, Lars Arvestad^{2

3

4}

Affiliations

¹ KTH Royal Institute of Technology, Science for Life Laboratory, School of Computer Science and Communication, Solna, SE-171 77, Sweden.
² Department of Numerical Analysis and Computer Science, Stockholm University, Stockholm, SE-100 44, Sweden. arve@nada.su.se.
³ Swedish e-Science Research Centre, Stockholm, Sweden. arve@nada.su.se.
⁴ Science for Life Laboratory, Box 1031, Solna, SE-171 77, Sweden. arve@nada.su.se.

PMID: 27260514
PMCID: PMC4893229
DOI: 10.1186/s12862-016-0684-2

GenFamClust: an accurate, synteny-aware and reliable homology inference algorithm

Raja H Ali et al. BMC Evol Biol. 2016.

. 2016 Jun 4;16(1):120.

doi: 10.1186/s12862-016-0684-2.

Authors

Raja H Ali¹, Sayyed A Muhammad¹, Lars Arvestad^{2

3

4}

Affiliations

¹ KTH Royal Institute of Technology, Science for Life Laboratory, School of Computer Science and Communication, Solna, SE-171 77, Sweden.
² Department of Numerical Analysis and Computer Science, Stockholm University, Stockholm, SE-100 44, Sweden. arve@nada.su.se.
³ Swedish e-Science Research Centre, Stockholm, Sweden. arve@nada.su.se.
⁴ Science for Life Laboratory, Box 1031, Solna, SE-171 77, Sweden. arve@nada.su.se.

PMID: 27260514
PMCID: PMC4893229
DOI: 10.1186/s12862-016-0684-2

Abstract

Background: Homology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology. Algorithms have also been designed to utilize conservation in gene order as an indication of homologous regions. We have developed GenFamClust, a method based on quantification of both gene order conservation and sequence similarity.

Results: In this study, we validate GenFamClust by comparing it to well known homology inference algorithms on a synthetic dataset. We applied several popular clustering algorithms on homologs inferred by GenFamClust and other algorithms on a metazoan dataset and studied the outcomes. Accuracy, similarity, dependence, and other characteristics were investigated for gene families yielded by the clustering algorithms. GenFamClust was also applied to genes from a set of complete fungal genomes and gene families were inferred using clustering. The resulting gene families were compared with a manually curated gold standard of pillars from the Yeast Gene Order Browser. We found that the gene-order component of GenFamClust is simple, yet biologically realistic, and captures local synteny information for homologs.

Conclusions: The study shows that GenFamClust is a more accurate, informed, and comprehensive pipeline to infer homologs and gene families than other commonly used homology and gene-family inference methods.

Keywords: Clustering; Gene family; Gene order conservation; Gene similarity; Gene synteny; Homology inference.

PubMed Disclaimer

Figures

**Fig. 1**
A brief introduction to GenFamClust that shows the modules and brief experimental settings of each module. The figure depicts the different modules, their functions, expected input, expected output and the author recommended software settings for each module

**Fig. 2**
Species-wise distribution of genes in YGOB pillars. The distribution of genes for each species in YGOB v.7 pillars such that all genes that have an ohnolog or ortholog are quantitatively measured against all the singleton genes (genes that do not have an ohnolog or ortholog assigned) for each species. The distribution shows that *L. Waltii* has the most singleton genes

**Fig. 3**
Cluster quality scores on simulated datasets for homology inference algorithms. The cluster quality scores of (a) single, average and complete linkage clustering when applied on homologs inferred by GenFamClust and by Neighborhood Correlation and (b) hcluster_sg, MCL, SiLiX and HiFiX clustering on BLAST scores and GenFamClust with single linkage clustering for each simulated dataset. Datasets are arranged in asscending order of similarity and then by asscending order of synteny. a Gene families inferred from GFC-based clustering methods (*solid lines*) are more accurate than those inferred from NC-based clustering methods (*dotted lines*) on all clustering algorithms and (b) Gene families inferred from GFC-Single (*blue line*) are more accurate than gene families inferred from similarity-only based clustering algorithms. The results are displayed in two panels for better legibility

**Fig. 4**
Precision-recall plot for precision and recall of various gene family inference methods on the metazoan dataset. The right top corner shows the maximum cluster quality for compared methods. GFC-Single and GFC-Complete have the best cluster quality followed by NC-Hierarchical and NC-Average. MCL, GFC-Complete and NC-Complete do not have data to test for recall beyond 0.54, 0.56 and 0.6 respectively. Other linkage algorithms on NC (NC-Single, NC-Average and NC-Hierarchical) have a maximum recall of 0.8 while GFC-Single and GFC-Average have a maximum recall of 0.75. Single linkage clustering on BLAST scores and HiFiX have maximum recall of 0.85. The inset zooms in for better legibility for recall between 0.45 and 0.6

**Fig. 5**
Cluster quality of various gene family inference methods on various families and the complete test data consisting of twenty families. The figure displays cluster quality of selected methods with default settings for various protein families as well as for twenty gene families in test data. Panels (a–d) display results for a particular protein family like FOX, TNFR, Kinase and USP respectively. FOX and TNFR are single domain architecture family while Kinase and USP have multidomain architecture with large sequence divergence and diverse domain architecture families. Panel (e) displays results for nineteen families (all except Kinases) because Kinases constitute more than half of proteins in number and could bias the overall results. Panel (f) displays the results for all twenty families. In all panels, GFC-based clustering methods have significantly higher or equal cluster quality than other methods

**Fig. 6**
Case study: Local synteny adding useful information for homology inference for proteins with domain insertion. The genes CD27 in *Mus musculus* and TNFRSF1A in *Homo sapiens* are homologous and belong to the TNFR superfamily. a The domain architecture of CD27 and TNFRSF1A. CD27 has an additional Death_TNFR domain which is absent in TNFRSF1A. The NC-score (relative to the metazoan reference data) for CD27 and TNFRSF1A is 0.351, which is under the 0.5 threshold recommended for calling the genes homologous [9]. b and c Gene order conservation between *Homo sapiens* chr. 12, *Pongo abelii* chr. 12 and *Mus musculus* chr. 6 containing CD27 and TNFRSF1A at the center and five genes upstream and downstream in all three chromosomes. Only hits with NC-scores greater than 0.9 are displayed in these panels. Common NC-hits between CD27 gene in *Mus musculus* and TNFRSF1A gene in *Homo sapiens* are marked and used for calculating synteny correlation score between both genes. This is illustrated in (c) where CD27 in *Pongo abelii* is a common NC-hit of both genes. Direct synteny score between both genes (shown in b) using SyS score of GenFamClust is 0.991 and synteny correlation score between both genes using SyC score is 0.983. The GFC score obtained for this pair of genes is 0.115 which, being is greater than 0, indicates that CD27 and TNFRSF1A are homologs

**Fig. 7**
Case study: Local synteny adding useful information for homology inference for highly divergent proteins. The genes Usp26 in *Mus musculus* and USP46 in *Homo sapiens* are homologous and belong to the USP superfamily. Both proteins are single domain proteins belonging to the Peptidase_C19 superfamily. The NC-score (relative to the metazoan reference data) for USP46 and Usp26 is 0.411, which is under the 0.5 threshold recommended for calling the genes homologous [9]. a and b Gene order conservation between *Homo sapiens* chr. 4, *Homo sapiens* chr. X and *Mus musculus* chr. X containing Usp26 and USP46 at the center and five genes upstream and downstream in all three chromosomes. Only hits with NC-scores greater than 0.7 are displayed in these panels. Common NC-hits between Usp26 gene in *Mus musculus* and USP46 gene in *Homo sapiens* are marked and used for calculating synteny correlation score between both genes. This is illustrated in (b) where USP26 in *Homo sapiens* is a common NC-hit of both genes. Direct synteny score between both genes (shown in a) using SyS score of GenFamClust is 0.741 and synteny correlation score between both genes using SyC score is 0.672. The GFC positive score obtained for this pair of genes, 0.045, indicates that Usp26 and USP46 are homologs despite high sequence divergence and little sequence similarity

**Fig. 8**
Cluster coherence for GFC-Single with other gene family inference methods. The bar chart displays coherence at cluster level between GFC-Single and other gene family inference methods. A cluster is termed common cluster if it can be found with exactly the same members in both softwares. A cluster is termed as subset if the first software contains two or more clusters merged together as a single cluster in the second software. Any cluster which is neither common nor subset (or superset) is considered contradictory. As expected, NC-based methods, i.e., NC-Single, NC-Average, NC-Complete, GFC-Average and GFC-Complete have the most number of clusters in common with GFC-Single (shown with blue parts of the bar) and there are no contradictory clusters between these software and GFC-Single. However, other software (HiFiX, MCL, hcluster_sg and SiLiX) have relatively less common clusters with GFC-Single and a few contradictory clusters can also be observed for HiFiX and SiLiX

**Fig. 9**
Protein distribution by GFC-Single in comparison with other gene family inference methods. Bar chart displays the protein distribution according to the cluster a protein is found in, where clusters can be common, subset, superset or contradictory as discussed before. The protein distribution in each class shows aggressive clustering behavior of GFC-based clustering methods w.r.t. other software, i.e., when compared with the corresponding bar in Fig. 8, we notice that the percentage proteins contained in common clusters is significantly less than the percentage of common clusters while the percentage proteins contained in subset clusters of GFC-Single has substantially increased than the percentage of subset clusters of GFC-Single for all software

**Fig. 10**
GFC cluster agreement with YGOB pillars. The figure shows bar charts displaying the agreement and disagreement between YGOB pillars and clusters formed by single linkage and average linkage on homologs inferred from GFC. First, clusters determined by GFC are mapped onto YGOB pillars in (a) for GFC-Average and in (b) for GFC-Single, where each bar displays the percentage of YGOB pillars in that category. Second, YGOB pillars are mapped onto clusters inferred by GFC-Average in (c) and in (d) by GFC-Single, where each bar represents percentage of GFC clusters in that category. The track “additional pillars” displays the percentage of clusters looking like a pillar, with genes are from different species and containing singleton genes in YGOB. The track “GFC clusters superset of YGOB pillars” represents the percentage of GFC clusters that contain one or more YGOB pillar(s) completely. The track “contradictions” represents the percentage of YGOB pillars/GFC clusters where GFC places two genes from same YGOB pillar in different clusters

**Fig. 11**
Case study: Difference between GFC cluster and YGOB pillars - Phylogenetic analysis. a displays a portion of 6 YGOB pillars, where we are interested in the two pillars (*red* and *blue* columns). YGOB classifies them as separate clusters as shown in (a) but GFC-Single and GFC-Average classifies Ecym_1340 to be part of *blue pillar* and AEL037C to be a singleton gene. a also shows that both the cluster and the pillar has good syntenic support via neighboring pillars. b displays the D/L score and the most parsimonious reconciliation of gene tree with species tree constructed from GFC cluster with/without Ecym_1340 added and highlighted here in *blue*. The cluster given by GFC-based methods (Ecym_1340 added to the *blue cluster* in a) has the lowest D/L score to gene ratio and is, therefore, phylogenetically most probable. There exist multiple genes in the *blue pillar* that have a BLAST hit with Ecym_1340 and there does not exist a BLAST hit with evalue 10 between Ecym_1340 and AEL037C showing similarity support for GFC cluster and lack of similarity for the genes in *red pillar* in (a)

See this image and copyright information in PMC

References

1. Levitt M. The birth of computational structural biology. Nat Struct Biol. 2001;8:392–3. doi: 10.1038/87545. - DOI - PubMed
1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19(2):99–113. doi: 10.2307/2412448. - DOI - PubMed
1. Koonin EV. Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet. 2005;39:309–38. doi: 10.1146/annurev.genet.39.073003.114725. - DOI - PubMed
1. Parker J, Tsagkogeorga G, Cotton JA, Liu Y, Provero P, Stupka E, Rossiter SJ. Genome-wide signatures of convergent evolution in echolocating mammals. Nature. 2013;502(7470):228–31. doi: 10.1038/nature12511. - DOI - PMC - PubMed
1. Basu MK, Carmel L, Rogozin IB, Koonin EV. Evolution of protein domain promiscuity in eukaryotes. Genome Res. 2008;18(3):449–61. doi: 10.1101/gr.6943508. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Associated data

figshare/10.6084/m9.figshare.1536467

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GenFamClust: an accurate, synteny-aware and reliable homology inference algorithm

Affiliations

GenFamClust: an accurate, synteny-aware and reliable homology inference algorithm

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Associated data

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases