Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 28;18(1):14.
doi: 10.1186/s13015-023-00238-y.

Efficient gene orthology inference via large-scale rearrangements

Affiliations

Efficient gene orthology inference via large-scale rearrangements

Diego P Rubert et al. Algorithms Mol Biol. .

Abstract

Background: Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. The mentioned ILP includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space.

Results: In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into [Formula: see text] subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on primate and fruit fly genomes show two positive results. First, for complete assemblies of five primates the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the version of our tool with optimal capping. Second, we were able to efficiently analyze fruit fly genomes with incomplete assemblies distributed in hundreds or even thousands of contigs, obtaining gene families that are very similar to [Formula: see text] families. Indeed, our tool inferred a higher number of complete cliques, with a higher intersection with [Formula: see text], when compared to gene families computed by other inference tools. We added a post-processing for refining, with the aid of the [Formula: see text] algorithm, our ambiguous families (those with more than one gene per genome), improving even more the accuracy of our results. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities and the post-processing refinement of ambiguous families with [Formula: see text]. Both the original version with optimal capping and the new modified version with heuristic capping can be downloaded, together with their detailed documentations, at https://gitlab.ub.uni-bielefeld.de/gi/FFGC or as a Conda package at https://anaconda.org/bioconda/ffgc .

Keywords: Comparative genomics; Double-cut-and-join; Gene orthology; Indels.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
On the top part is displayed the gene similarity graph S(A,B) of genomes A={[1234][56¯]} and B={[78¯9¯10¯111213]} and next to it a table with the ranking of four distinct ortholog-sets. On the middle the rearrangement scenarios induced by two of these ortholog-sets are shown. On the bottom part the family-free relational graph FFR(A,B,S) is illustrated, highlighting the edges of the decomposition corresponding to the (black) ortholog-set O={{1,7}, {3,10},{4,9},{5,13}}. (This decomposition has two AB-paths, one AA-path and one cycle.) All extremity and indel edges in FFR(A,B,S) are respectively scored and weighted according to S(A,B) but the scores and weights of edges not derived from O or O~ are omitted
Fig. 2
Fig. 2
On the top part we show the capping of the decomposition corresponding to the (black) ortholog-set O={{1,7}, {3,10},{4,9},{5,13}} from the gene similarity graph S(A,B) of Fig. 1 (bottom). Each red vertex is a cap vertex. Each filled (red) vertex is connected to a telomere (chromosome/path ends). The unfilled vertices represent the extra (equalizing) vertices connected by a dummy adjacency. The capping is a perfect matching of the complete bipartite graph of the cap vertices. The optimal capping for this decomposition is highlighted. It closes each of its paths into a separate cycle. (In general, an optimal capping of a decomposition may link up to 4 paths into a single cycle [8]). On the bottom part is displayed the complete family-free graph FFR(A,B,S) optimally capped. Cap edges are unweighted. Scores of extremity edges and weights of indel edges are omitted
Fig. 3
Fig. 3
Examples of two arbitrary lighter valid cappings of the family-free relational diagram from Fig. 1 (bottom) and their effects on the ranking of ortholog-sets/sibling-sets represented in Fig. 1 (top). Both cappings affect the computed distances, but, while the capping shown in the left (cyan) preserves the optimal ranking, the one shown in the right (orange) does not
Fig. 4
Fig. 4
a Example of a shared-content graph C(A,B) with omitted edge scores. The genomes A and B have linear segments A1..5 and B1..5, respectively. The capping of FFR(A,B,S) induced by C is invalid. b Transformation of C into a perfect shared-content graph C^(A,B): vertex sets S1 and S2 represent Hall violators (among other possibilities) that demand the creation of dummy segments φB1 and φA1, respectively. Dotted edges represent those that are non-matchable and must be removed from C^ after the completion is finished. Notice that the component with vertices A1,A2,A3,B1,φB1andB2 is not a complete bipartite subgraph. (In both (a, b), we give an abstract illustration of the capped FFR where only cap vertices, cap edges and dummy adjacencies are represented explicitly, while vertices of gene extremities between cap vertices are represented by a line with small dots. In addition, colored solid edges represent a maximum cardinality matching between cap vertices, while the cap edges not in the matching are dashed grey)
Fig. 5
Fig. 5
The pipeline of our approach is straightforward: our gene families are the connected components of the n-partite graph derived by the integration of the computed ortholog-sets. The resulting ambiguous families can be optionally refined with the help of MCL
Fig. 6
Fig. 6
Types of families given by the integration of three ortholog-sets into a 3-partite graph
Fig. 7
Fig. 7
The numbers of resolved incomplete and complete families in FLYBASE, followed by the numbers of families inferred by OMAHOGS, OMAGROUPS, PROTEINORTHO, POFF and ORTHOFFGC (without and with MCL refinement). The lower part of each bar represents the intersection between the inferred sets and FLYBASE. (For resolved complete families, the numbers of families in the intersections are shown on the top of bars)
Fig. 8
Fig. 8
Precision TPTP+FP, recall TPTP+FN and their harmonic mean F1-score for OMAHOGS, OMAGROUPS, PROTEINORTHO, POFF and ORTHOFFGC (without and with MCL refinement), based on the dataset with eleven Drosophilas

References

    1. Bergeron A, Mixtacki J, Stoye J. A unifying view of genome rearrangements. In: Proc. of WABI. Lecture Notes in Bioinformatics, 2006;4175:163–173.
    1. Hannenhalli S, Pevzner PA. Transforming men into mice (polynomial algorithm for genomic distance problem). In: Proc. of FOCS, 1995:581–592.
    1. Braga MDV, Willing E, Stoye J. Double cut and join with insertions and deletions. J Comput Biol. 2011;18(9):1167–1184. doi: 10.1089/cmb.2011.0118. - DOI - PubMed
    1. Sankoff D. Genome rearrangement with gene families. Bioinformatics. 1999;15(11):909–917. doi: 10.1093/bioinformatics/15.11.909. - DOI - PubMed
    1. Bryant D. The complexity of calculating exemplar distances. In: Sankoff D, Nadeau JH, editors. Comparative Genomics. Computational Biology Series. London: Kluver Academic Publishers; 2000. pp. 207–211.

LinkOut - more resources