Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 18;120(16):e2218329120.
doi: 10.1073/pnas.2218329120. Epub 2023 Apr 12.

Identification of hidden associations among eukaryotic genes through statistical analysis of coevolutionary transitions

Affiliations

Identification of hidden associations among eukaryotic genes through statistical analysis of coevolutionary transitions

Elena Dembech et al. Proc Natl Acad Sci U S A. .

Abstract

Coevolution at the gene level, as reflected by correlated events of gene loss or gain, can be revealed by phylogenetic profile analysis. The optimal method and metric for comparing phylogenetic profiles, especially in eukaryotic genomes, are not yet established. Here, we describe a procedure suitable for large-scale analysis, which can reveal coevolution based on the assessment of the statistical significance of correlated presence/absence transitions between gene pairs. This metric can identify coevolution in profiles with low overall similarities and is not affected by similarities lacking coevolutionary information. We applied the procedure to a large collection of 60,912 orthologous gene groups (orthogroups) in 1,264 eukaryotic genomes extracted from OrthoDB. We found significant cotransition scores for 7,825 orthogroups associated in 2,401 coevolving modules linking known and unknown genes in protein complexes and biological pathways. To demonstrate the ability of the method to predict hidden gene associations, we validated through experiments the involvement of vertebrate malate synthase-like genes in the conversion of (S)-ureidoglycolate into glyoxylate and urea, the last step of purine catabolism. This identification explains the presence of glyoxylate cycle genes in metazoa and suggests an anaplerotic role of purine degradation in early eukaryotes.

Keywords: coevolution; gene association; glyoxylate cycle; purine catabolism; statistical significance.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Pipeline and metrics for coevolutionary analysis. (A) Scheme of the workflow used for coevolutionary analysis. (B) Scheme illustrating the metrics used for measuring score and significance of coevolutionary associations. Two evolutionary scenarios (Case 1 and Case 2) with 16 species (S1 to S16) related by a phylogenetic tree and two orthogroups (G1, G2) with the same correspondence (15 matches, 1 mismatch) of presence (black squares) and absence (white squares) states are compared. For the two cases, reported are the score and significance computed from the enumeration of state transitions (1→0, 0→1) along the phylogenetic profile vectors. The cotr_score is a function of the total state transitions (t1, t2) and the number of concordant (c) and discordant (d) transitions (k = c − d; see also SI Appendix, Fig. S1). Unshared transitions (gray) are counted in the total number of transitions (t1 or t2), but they are considered neither concordant or discordant. Significance (P-value) is calculated from the transition table using Fisher’s exact test (Methods). Standard measures of profile similarities (Jaccard, Pearson) and Pearson P-values are reported for comparison. Although the associations between G1 and G2 receive the same scores and significance by global measures of similarity, the cotr_score is more significant in Case 2, consistent with the higher occurrence of shared gene losses, and the higher confidence of functional association.
Fig. 2.
Fig. 2.
General results of the coevolutionary analysis. (A) Volcano plot showing the relation between score (cotr_score) and significance (adjusted P-values; P. adj) of 4,727,281 orthogroup pairs with unadjusted P-value <10−3; the numerosity of individual dots is indicated by the dot color as shown in the scale bar. The dashed gray line represents the p.adj cutoff (1e-3) used in subsequent analyses; transitions were calculated using an RL tree. (B) Size distribution of 2,401 coevolving modules obtained by applying the Markov cluster (MCL) algorithm to individual orthogroup pairs; 22,865 pairs with positive cotr_score and significant p.adj value in all fully resolved tree orientations were considered. (C) Organism distribution of coevolving modules. Colors indicate the fraction of orthogroups belonging to the same module present in individual genomes as shown in the scale bar. Vertical lines indicate the boundaries of taxonomic groups at the kingdom level. D=Discoba; groups with <20 members are unlabeled. Species are ordered according to an RL tree polarized with Viridiplantae as the starting node. Modules are ordered according to their Canberra distance in species distribution.
Fig. 3.
Fig. 3.
Functional relationships of coevolving orthogroups. (A) Similarity of gene ontology (GO) experimental annotation in orthogroup pairs binned by unadjusted P-values. The fraction of orthogroup pairs with semantic similarity scores above the mean is reported for the GO aspects cellular component (CC), biological process (BP), and molecular function (MF). (B) Enrichment bar-plot of GO BP using an enrichment P-value cutoff of 0.001; NES = normalized enrichment score. (C) Module experimental annotation (fraction of orthogroups) in GO, KEGG, and KEGG metabolism (KEGG_M) databases. (D) Consensus of database annotation in modules. The database annotation consensus (Db consensus) indicates the fraction of annotated orthogroups included in the same GO-Slim term or KEGG map. (E) Enrichment bar-plot of KEGG metabolism using an enrichment P-value cutoff of 0.01; NES = normalized enrichment score.
Fig. 4.
Fig. 4.
Identification of a missing gene in purine catabolism through cotr analysis. (A) Schematic network showing significant cotr associations of orthogroups representing urate oxidase (Uox) and allantoicase (Allc) with malate synthase (MS); other significant associations involve isocitrate lyase (ICL), HIU hydrolase (Urah), and OHCU decarboxylase (Urad). Modules defined by mcl clustering are encircled by ovals. (B) Scheme of the purine degradation and glyoxylate shunt reactions with the corresponding enzymes: True-positive genes retrieved by cotr analysis are shaded as in A; the false-negative allantoinase (Alln), member of the dihydropyrimidinase multigene family, is shaded gray; ureidoglycolate lyase (UGL) encoded by an unknown gene in Metazoa is shaded yellow. (C) Distribution map of Uox, Allc, MS, and ICL orthogroups across the ncbi phylogeny of selected eukaryotic species with emphasis on Metazoa. PhyloPic silhouettes (https://www.phylopic.org) are added to selected branches to aid species identification. A less detailed diagram of the gene distribution in 1,264 species is shown in SI Appendix, Fig. S13.
Fig. 5.
Fig. 5.
Danio rerio malate synthase-like (DrMSL) encodes ureidoglycolate lyase (UGL). (A) Cartoon representation of the DrMSL 3D homology model superimposed on the experimental structure of E.coli malate synthase A (PDB ID: 3CUZ) (41). Residues relevant for MS activity and the corresponding residues in DrMSL are drawn in sticks. The close-up panels show lack of conservation at the acetyl-CoA-binding site (red box) and conservation of residues involved in glyoxylate binding (blue box), except for two substitutions in DrMSL. (B) Sequence logo of MS and MS-like C-terminal sequences of selected eukaryotic species depicting the presence of a PTS1 motif. (C) Venn diagram of Homo sapiens and D. rerio proteomes with intersections defined by the expected features of the UGL enzyme (MW: 64 kDa ± 10%, PTS1 signal, present in D. rerio not in Homo sapiens). Accession numbers of the two proteins retrieved by the search (MSL and “protein brambleberry precursor”) are written in blue and in black. (D) Stacked plots of 1H NMR spectra of 52.5 mM ureidoglycolate in 95% D2O recorded at different time points after the addition of 2 µM DrMSL preincubated with 3 mM MgCl2. (E) Circular dichroism (CD) time-evolution spectra of 2.5 mM ureidoglycolate in the presence of 1 µM DrMSL, showing formation of the (R)-ureidoglycolate spectrum (42). (F) Fitting with the Michaelis–Menten equation of the initial velocity (V0) of 1 µM DrMSL with different substrate concentrations; the calculated kinetics constants are shown in the inset. Error bars represent SD of three independent experiments.
Fig. 6.
Fig. 6.
Evolutionary and functional divergence of malate synthase and ureidoglycolate lyase. (A) Unrooted maximum likelihood tree of MS and UGL sequences constructed using PhyML with the LG model (49). The scale bar corresponds to the number of calculated substitutions per site (0.5). Selected terminal nodes are labeled with the abbreviated species name; metazoan species are included in salmon triangles and other species are colored according to taxonomy. Proteins characterized in this work as malate synthase (MS) and ureidoglycolate lyase (UGL) are indicated by arrows. The branch of the inferred gene duplication separating MS and UGL is indicated by a red asterisk; the gray segment indicates uncertainty in the node position along the branch. (B) Superimposed spectra of acetyl-CoA (0.25 mM, solid line) and CoA (0.25 mM, dotted line). (C) Kinetics of the condensation of acetyl-CoA (0.25 mM) and glyoxylate (0.50 mM) catalyzed by NvMS, monitored at 232 nm. The assay was performed in the presence of 1 mM MgCl2 (dashed line), or 0.125 µM NvMS (light blue line), or both MgCl2 and NvMS (red line). (D) Malate synthase-specific activity of DrMSL and NvMS in the presence or in the absence of 1 mM MgCl2. (E) Ureidoglycolate lyase-specific activity of DrMSL and NvMS in the presence or in the absence of 1mM MgCl2.
Fig. 7.
Fig. 7.
Evolutionary connection of glyoxylate cycle and purine catabolism. The glyoxylate shunt (green arrows) bypasses two decarboxylation reactions of the Krebs cycle by converting isocitrate into succinate and glyoxylate through the enzyme isocitrate lyase (ICL). Glyoxylate is converted into (S)-malate, a Krebs cycle intermediate, by malate synthase (MS) through condensation with acetyl-CoA. Glyoxylate is also formed as the end product of purine degradation from ureidoglycolate in the reaction catalyzed by ureidoglycolate lyase (UGL). The evolutionary origin of MS and UGL by an ancient gene duplication in eukaryotes (red line) suggests a link between the two metabolic pathways and an anaplerotic role of purine catabolism in early eukaryotes.

Comment in

  • Phylogenetic profiling in eukaryotes comes of age.
    Moi D, Dessimoz C. Moi D, et al. Proc Natl Acad Sci U S A. 2023 May 9;120(19):e2305013120. doi: 10.1073/pnas.2305013120. Epub 2023 May 1. Proc Natl Acad Sci U S A. 2023. PMID: 37126713 Free PMC article. No abstract available.

Similar articles

Cited by

References

    1. de Juan D., Pazos F., Valencia A., Emerging methods in protein co-evolution. Nat. Rev. Genet. 14, 249–261 (2013). - PubMed
    1. Burger L., Van Nimwegen E., Disentangling direct from indirect co-evolution of residues in protein alignments. PLoS Comput. Biol. 6, e1000633 (2010). - PMC - PubMed
    1. Kamisetty H., Ovchinnikov S., Baker D., Assessing the utility of coevolution-based residue–residue contact predictions in a sequence-and structure-rich era. Proc. Natl. Acad. Sci. U.S.A. 110, 15674–15679 (2013). - PMC - PubMed
    1. Ebert D., Fields P. D., Host–parasite co-evolution and its genomic signature. Nat. Rev. Genet. 21, 754–768 (2020). - PubMed
    1. Tatusov R. L., Koonin E. V., Lipman D. J., A genomic perspective on protein families. Science 278, 631–637 (1997). - PubMed

Publication types

LinkOut - more resources