Efficient gene orthology inference via large-scale rearrangements

Diego P Rubert^{1

2}, Marília D V Braga³

Affiliations

¹ Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil.
² Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.
³ Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany. mbraga@cebitec.uni-bielefeld.de.

PMID: 37770945
PMCID: PMC10540461
DOI: 10.1186/s13015-023-00238-y

Efficient gene orthology inference via large-scale rearrangements

Diego P Rubert et al. Algorithms Mol Biol. 2023.

. 2023 Sep 28;18(1):14.

doi: 10.1186/s13015-023-00238-y.

Authors

Diego P Rubert^{1

2}, Marília D V Braga³

Affiliations

¹ Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, Brazil.
² Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany.
³ Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany. mbraga@cebitec.uni-bielefeld.de.

PMID: 37770945
PMCID: PMC10540461
DOI: 10.1186/s13015-023-00238-y

Abstract

Background: Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. The mentioned ILP includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space.

Results: In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into [Formula: see text] subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on primate and fruit fly genomes show two positive results. First, for complete assemblies of five primates the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the version of our tool with optimal capping. Second, we were able to efficiently analyze fruit fly genomes with incomplete assemblies distributed in hundreds or even thousands of contigs, obtaining gene families that are very similar to [Formula: see text] families. Indeed, our tool inferred a higher number of complete cliques, with a higher intersection with [Formula: see text], when compared to gene families computed by other inference tools. We added a post-processing for refining, with the aid of the [Formula: see text] algorithm, our ambiguous families (those with more than one gene per genome), improving even more the accuracy of our results. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities and the post-processing refinement of ambiguous families with [Formula: see text]. Both the original version with optimal capping and the new modified version with heuristic capping can be downloaded, together with their detailed documentations, at https://gitlab.ub.uni-bielefeld.de/gi/FFGC or as a Conda package at https://anaconda.org/bioconda/ffgc .

Keywords: Comparative genomics; Double-cut-and-join; Gene orthology; Indels.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
On the top part is displayed the gene similarity graph $S (A, B)$ of genomes $A = {[1 2 3 4] [5 \bar{6}]}$ and $B = {[7 \bar{8} \bar{9} \bar{10} 11 12 13]}$ and next to it a table with the ranking of four distinct ortholog-sets. On the middle the rearrangement scenarios induced by two of these ortholog-sets are shown. On the bottom part the family-free relational graph $F F R (A, B, S)$ is illustrated, highlighting the edges of the decomposition corresponding to the (black) ortholog-set $O = {{1, 7}$ , ${3, 10}, {4, 9}, {5, 13}}$ . (This decomposition has two $A B$ -paths, one $A A$ -path and one cycle.) All extremity and indel edges in $F F R (A, B, S)$ are respectively scored and weighted according to $S (A, B)$ but the scores and weights of edges not derived from $O$ or $\tilde{O}$ are omitted

**Fig. 2**
On the top part we show the capping of the decomposition corresponding to the (black) ortholog-set $O = {{1, 7}$ , ${3, 10}, {4, 9}, {5, 13}}$ from the gene similarity graph $S (A, B)$ of Fig. 1 (bottom). Each red vertex is a cap vertex. Each filled (red) vertex is connected to a telomere (chromosome/path ends). The unfilled vertices represent the extra (equalizing) vertices connected by a dummy adjacency. The capping is a perfect matching of the complete bipartite graph of the cap vertices. The optimal capping for this decomposition is highlighted. It closes each of its paths into a separate cycle. (In general, an optimal capping of a decomposition may link up to 4 paths into a single cycle [8]). On the bottom part is displayed the complete family-free graph $F F R (A, B, S)$ optimally capped. Cap edges are unweighted. Scores of extremity edges and weights of indel edges are omitted

**Fig. 3**
Examples of two arbitrary lighter valid cappings of the family-free relational diagram from Fig. 1 (bottom) and their effects on the ranking of ortholog-sets/sibling-sets represented in Fig. 1 (top). Both cappings affect the computed distances, but, while the capping shown in the left (cyan) preserves the optimal ranking, the one shown in the right (orange) does not

**Fig. 4**
a Example of a shared-content graph $C (A, B)$ with omitted edge scores. The genomes $A$ and $B$ have linear segments $A_{1 . . 5}$ and $B_{1 . . 5}$ , respectively. The capping of $F F R (A, B, S)$ induced by $C$ is invalid. b Transformation of $C$ into a perfect shared-content graph $\hat{C} (A, B)$ : vertex sets $S_{1}$ and $S_{2}$ represent Hall violators (among other possibilities) that demand the creation of dummy segments $φ_{B}^{1}$ and $φ_{A}^{1}$ , respectively. Dotted edges represent those that are non-matchable and must be removed from $\hat{C}$ after the completion is finished. Notice that the component with vertices $A_{1}, A_{2}, A_{3}, B_{1}, φ_{B}^{1} and B_{2}$ is not a complete bipartite subgraph. (In both (a, b), we give an abstract illustration of the capped $F F R$ where only cap vertices, cap edges and dummy adjacencies are represented explicitly, while vertices of gene extremities between cap vertices are represented by a line with small dots. In addition, colored solid edges represent a maximum cardinality matching between cap vertices, while the cap edges not in the matching are dashed grey)

**Fig. 5**
The pipeline of our approach is straightforward: our gene families are the connected components of the n-partite graph derived by the integration of the computed ortholog-sets. The resulting ambiguous families can be optionally refined with the help of $M C L$

**Fig. 6**
Types of families given by the integration of three ortholog-sets into a 3-partite graph

**Fig. 7**
The numbers of resolved incomplete and complete families in $F L Y B A S E$ , followed by the numbers of families inferred by $O M A HOG S$ , $O M A G R O U P S$ , $P R O T E I N O R T H O$ , $P O F F$ and $O R T H O FFGC \approx$ (without and with $M C L$ refinement). The lower part of each bar represents the intersection between the inferred sets and $F L Y B A S E$ . (For resolved complete families, the numbers of families in the intersections are shown on the top of bars)

**Fig. 8**
Precision $(\frac{TP}{TP + FP})$ , recall $(\frac{TP}{TP + FN})$ and their harmonic mean $F_{1}$ -score for $O M A HOG S$ , $O M A G R O U P S$ , $P R O T E I N O R T H O$ , $P O F F$ and $O R T H O FFGC \approx$ (without and with $M C L$ refinement), based on the dataset with eleven *Drosophilas*

See this image and copyright information in PMC

References

1. Bergeron A, Mixtacki J, Stoye J. A unifying view of genome rearrangements. In: Proc. of WABI. Lecture Notes in Bioinformatics, 2006;4175:163–173.
1. Hannenhalli S, Pevzner PA. Transforming men into mice (polynomial algorithm for genomic distance problem). In: Proc. of FOCS, 1995:581–592.
1. Braga MDV, Willing E, Stoye J. Double cut and join with insertions and deletions. J Comput Biol. 2011;18(9):1167–1184. doi: 10.1089/cmb.2011.0118. - DOI - PubMed
1. Sankoff D. Genome rearrangement with gene families. Bioinformatics. 1999;15(11):909–917. doi: 10.1093/bioinformatics/15.11.909. - DOI - PubMed
1. Bryant D. The complexity of calculating exemplar distances. In: Sankoff D, Nadeau JH, editors. Comparative Genomics. Computational Biology Series. London: Kluver Academic Publishers; 2000. pp. 207–211.

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient gene orthology inference via large-scale rearrangements

Affiliations

Efficient gene orthology inference via large-scale rearrangements

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous