Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Apr;22(4):755-65.
doi: 10.1101/gr.123901.111. Epub 2012 Jan 23.

Unified modeling of gene duplication, loss, and coalescence using a locus tree

Affiliations

Unified modeling of gene duplication, loss, and coalescence using a locus tree

Matthew D Rasmussen et al. Genome Res. 2012 Apr.

Abstract

Gene phylogenies provide a rich source of information about the way evolution shapes genomes, populations, and phenotypes. In addition to substitutions, evolutionary events such as gene duplication and loss (as well as horizontal transfer) play a major role in gene evolution, and many phylogenetic models have been developed in order to reconstruct and study these events. However, these models typically make the simplifying assumption that population-related effects such as incomplete lineage sorting (ILS) are negligible. While this assumption may have been reasonable in some settings, it has become increasingly problematic as increased genome sequencing has led to denser phylogenies, where effects such as ILS are more prominent. To address this challenge, we present a new probabilistic model, DLCoal, that defines gene duplication and loss in a population setting, such that coalescence and ILS can be directly addressed. Interestingly, this model implies that in addition to the usual gene tree and species tree, there exists a third tree, the locus tree, which will likely have many applications. Using this model, we develop the first general reconciliation method that accurately infers gene duplications and losses in the presence of ILS, and we show its improved inference of orthologs, paralogs, duplications, and losses for a variety of clades, including flies, fungi, and primates. Also, our simulations show that gene duplications increase the frequency of ILS, further illustrating the importance of a joint model. Going forward, we believe that this unified model can offer insights to questions in both phylogenetics and population genetics.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Different views of gene trees and species trees. (A) In the dup-loss model, a congruent gene tree and species tree indicates that all genes are orthologs. (B) Incongruence indicates the presence of gene duplications (stars) and gene losses (red “X”). (C) An example of the Wright-Fisher (WF) process and the coalescence of three lineages within the population. (D) A multispecies coalescent is a combination of WF processes for each branch of the species tree. In this model, no duplications or losses are allowed, but a gene tree can be incongruent due to a phenomenon known as incomplete lineage sorting (ILS). (E) In the dup-loss model, the same gene tree in panel D can be explained using one gene duplication and at least three gene losses. ILS cannot be modeled in the dup-loss model.
Figure 2.
Figure 2.
Duplication and loss events within a multispecies coalescent. (A) A duplication occurs in one chromosome and creates a new locus, “locus 2,” in the genome. At locus 2, the Wright-Fisher model dictates how the frequency p of the daughter duplicate (black dots) competes with the null allele (white dots) until it eventually fixes (p = 1). A gene tree is therefore a “traceback” in this combined process. (B) A new duplicate can undergo hemiplasy, and fixes in some lineages and goes extinct in others. (C) Similar to duplication, a gene loss (deletion) starts in one chromosome and drifts until it fixes or goes extinct.
Figure 3.
Figure 3.
Generative process for the DLCoal model. (A) Given a species tree S with known topology and divergence times, a top-down dup-loss process generates a locus tree TL, which contains duplication nodes (star), and each daughter duplicate is indicated by a daughter edge δL (dark red). From the locus tree, the bottom-up multilocus coalescent (MLC) process generates a gene tree TG. Mappings between the trees represented by RG and RL indicate how one tree “fits inside” the other. This diagram depicts the same gene family as Figure 2A. (B) The multispecies coalescent and dup-loss model are special cases of DLCoal. When there are no duplications or losses (i.e., locus tree and species tree congruence), the model simplifies to the multispecies coalescent. (C) When ILS is assumed not to occur (i.e., gene tree and locus tree congruence), the model simplifies to the birth–death model for duplication and loss.
Figure 4.
Figure 4.
Species trees used in evaluation. (A,B) For our simulation evaluations, we used a data set of 15 primates (including two outgroup species) and 12 Drosophila species. (C) For our evaluation on real data, we used 16 species of fungi.
Figure 5.
Figure 5.
Increased performance of DLCoalRecon in simulated fly and primate gene trees. DLCoalRecon (solid) and MPR (dashed) were used to reconcile 500 fly and 500 primate simulated gene trees. Duplications and losses were simulated at rates that were the same as (1×, red), twice (2×, green), and four times (4×, blue) the rate estimated in real data. Increased performance is seen both in the precision of inferring duplications and losses (A,B,D,E) as well as the accuracy of reconstructing the locus tree topology (C,F).
Figure 6.
Figure 6.
Cumulative distribution of duplication consistency scores. Each gene tree reconstruction program was used genome-wide to infer the duplications present in 16 fungi species. For each duplication, we computed the consistency score. Among all of the programs, the combination of PhyML+DLCoalRecon infers the fewest duplications with a score of zero (1.6%) and the most duplications with a score of one (74.5%).
Figure 7.
Figure 7.
Duplications increase the rate of incomplete lineage sorting (ILS). Using the DLCoal model, we simulated 2000 gene trees for the 12 flies phylogeny, using an effective population size of N = 5 × 106, duplication-loss rates of λ = μ = 0.0048 events/gene per million years, and 10 generations/yr. (A) As more gene duplications occur in a gene tree, the probability of ILS increases. (B) Overall, larger gene families tend to have increased ILS frequency. Error bars indicate 95% confidence intervals.

Similar articles

Cited by

References

    1. Åkerborg O, Sennblad B, Arvestad L, Lagergren J 2009. Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proc Natl Acad Sci 106: 5714–5719 - PMC - PubMed
    1. Arvestad L, Berglund A-C, Lagergren J, Sennblad B 2003. Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics 19: i7–i15 - PubMed
    1. Arvestad L, Berglund A, Lagergren J, Sennblad B 2004. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. In Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (ed. PE Bourne), pp. 326–335. doi: 10.1145/974614.974657. ACM, New York
    1. Arvestad L, Lagergren J, Sennblad B 2009. The gene evolution model and computing its associated probabilities. J ACM 56: 1–44
    1. Avise JC, Robinson TJ 2008. Hemiplasy: A new term in the lexicon of phylogenetics. Syst Biol 57: 503–507 - PubMed

Publication types

LinkOut - more resources