Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar 2;14(3):evac013.
doi: 10.1093/gbe/evac013.

Champagne: Automated Whole-Genome Phylogenomic Character Matrix Method Using Large Genomic Indels for Homoplasy-Free Inference

Affiliations

Champagne: Automated Whole-Genome Phylogenomic Character Matrix Method Using Large Genomic Indels for Homoplasy-Free Inference

James K Schull et al. Genome Biol Evol. .

Abstract

We present Champagne, a whole-genome method for generating character matrices for phylogenomic analysis using large genomic indel events. By rigorously picking orthologous genes and locating large insertion and deletion events, Champagne delivers a character matrix that considerably reduces homoplasy compared with morphological and nucleotide-based matrices, on both established phylogenies and difficult-to-resolve nodes in the mammalian tree. Champagne provides ample evidence in the form of genomic structural variation to support incomplete lineage sorting and possible introgression in Paenungulata and human-chimp-gorilla which were previously inferred primarily through matrices composed of aligned single-nucleotide characters. Champagne also offers further evidence for Myomorpha as sister to Sciuridae and Hystricomorpha in the rodent tree. Champagne harbors distinct theoretical advantages as an automated method that produces nearly homoplasy-free character matrices on the whole-genome scale.

Keywords: homoplasy-free characters; incomplete lineage sorting; phylogenetics; phylogenomics; rare genomic changes.

PubMed Disclaimer

Figures

<sc>Fig</sc>. 1.
Fig. 1.
Champagne supports Hyracoidea as the sister group in the Paenungulata tree. (A) The maximum parsimony tree generated by PAUP* using Champagne’s character matrix for Paenungulata (rock hyrax, (elephant, manatee)), as well as the other two less parsimonious alternatives. The high number of Champagne supporting indels per topology (and a moderate RI) likely reflect ILS at the root of this subtree, and the imbalance of evidence per topology could be suggestive of introgression. (B) A multiple sequence alignment for a 124-bp deletion shared by elephant and manatee, one of 406 that supports our maximum parsimony topology. (C) A multiple sequence alignment for an 87-bp deletion shared by elephant and manatee that also supports our maximum parsimony topology. (D) A multiple sequence alignment for a 152-bp insertion shared by elephant and rock hyrax, supporting the alternative topology ((elephant, rock hyrax), manatee).
<sc>Fig</sc>. 2.
Fig. 2.
Champagne correctly reconstructs primate phylogeny, finding evidence for human–chimp–gorilla ILS. (A) At each node in the tree, we depict the number of indels identified by Champagne that support the corresponding clade. (B) Champagne finds 93, 67, and 35 indels supporting gorilla, human, and chimpanzee as outgroup to the other two species, suggesting a prevalence of ILS and possible introgression at this node. (C) A multiple sequence alignment for an 87-bp deletion shared uniquely by human and chimpanzee.
<sc>Fig</sc>. 3.
Fig. 3.
Champagne places Myomorpha sister to Sciuridae and Hystricomorpha. (A) The maximum parsimony tree generated by PAUP* using Champagne’s character matrix for a subset of rodents (left), alongside two less parsimonious trees that reflect alternate branching relationships between Sciuridae, Myomorpha, and Hystricomorpha. Sixty-six indels support Myomorpha as the sister clade, whereas only 8 and 3 support the other alternatives. (B) A multiple sequence alignment for a 54-bp deletion shared by guinea pig, naked mole rat, marmot, and squirrel. (C) A multiple sequence alignment for a 125-bp insertion shared by guinea pig, marmot, and squirrel.
<sc>Fig</sc>. 4.
Fig. 4.
An overview of the Champagne approach for speciation topology inference. In step 1, we use pairwise alignment chains between the outgroup (also used as reference) and each ingroup species (used as query) to assign at most one orthologous chain with high-confidence for each reference gene. The figure illustrates this procedure (based on Turakhia et al. [2020]) for a single outgroup–ingroup pair (human–pig) and a single reference gene. Each coding base-pair in the gene is assigned to the highest-scoring chain overlapping with the gene. If the highest-scoring overlapping chain also has the most base-pairs assigned, it is chosen as the best ortholog candidate (as shown). If gene-in-synteny and 1-to-1 mapping criteria are also satisfied (see Materials and Methods), the best candidate chain is assigned as gene ortholog. In all remaining cases, no assignment is made. In step 2, intragenic orthologous regions in all query species are scanned for each reference gene in search of phylogenetically informative, shared indels within the ingroup (see Materials and Methods and fig. 5 for details). In our illustration, four informative indels (labeled A, B, C, and D) are found. In step 3, the informative indels are printed to a NEXUS file, which is the final output of Champagne. In this example, we use this matrix in step 4, to infer the most parsimonious species tree, here ((pig, cow), dog), using PAUP* (Swofford 2002). Indels A and B in step 2 provide supporting evidence for ((pig, cow), dog), as only pig and cow share both indels. The other two indels, C and D, support ((cow, dog), pig) and ((pig, dog), cow) trees as most parsimonious, respectively. The low RI (0.5 of maximum 1) in this example reflects the relatively large fraction of nonsupporting, homoplasy-like evidence in this topology assignment.
<sc>Fig</sc>. 5.
Fig. 5.
Champagne’s indel verification method. (A) Shared insertion between pig and cow detected by Champagne that is absent in dog. (1) We first identify the presence of this insertion by finding a single-sided human gap in the human–pig orthologous chains, at human coordinate X. (2) Next, we find that there is no such single-sided gap in dog chain near X, we mark the insertion as likely absent in dog. (3) Next, we navigate to coordinate X in the human–cow chains, and check for a large (similar-sized) gap at X, within a 5-bp range of X. Finding such a gap, indicating an insertion, we mark the insertion as likely present in cow. (4) Finally, we perform a direct sequence comparison for sequence similarity. We extract a 30-bp-sized “window” sequence from either side of the insertion coordinate X in human, either side of the corresponding insertion coordinate in dog, and either side of the insertion itself in cow and pig. We also extract the sequence of the insertion itself in cow and pig. We then align the reference window sequences against each other species’ window sequences. Similarly, we align pig’s insertion sequence against cow’s insertion sequence. For each species in which we marked the indel as present, if the minimum sequence similarity for the left window, right window, and insertion (if the insertion is present) is greater than our stipulated threshold, we mark the species as definitively “+.” For each species in which we marked the indel as absent, if the sequence similarities for the left window and right window are greater than our stipulated threshold, we mark the species as definitively “−.” In either case, if a comparison fails to meet the threshold, we mark the species as “?.” (B) Symmetrical process for finding shared deletions.
<sc>Fig</sc>. 6.
Fig. 6.
A multiple-species alignment showing indels identified by Champagne in the pig, cow, and dog genomes, using human as reference species. (A) An illustration of the real pig, cow, and dog chains that align with a 14-Mb section of the human chromosome 2. Indels identified by Champagne in this section of the reference genome are shown: “I” indicates shared insertions, and “D” indicates shared deletions. On this stretch, we find five indels that are shared by pig and cow, supporting the most parsimonious topology ((pig, cow), dog), and only 1 (shown with a dashed arc) that is shared by dog and cow, possibly due to ILS. (B) A multiple sequence alignment of an 81-bp deletion shared by pig and cow, but not dog (leftmost deletion in panel A).

References

    1. Armstrong J, Fiddes IT, Diekhans M, Paten B.. 2019. Whole-genome alignment and comparative annotation. Annu Rev Anim Biosci. 7:41–64. - PMC - PubMed
    1. Beck RMD, Baillie C.. 2018. Improvements in the fossil record may largely resolve current conflicts between morphological and molecular estimates of mammal phylogeny. Proc R Soc Proc Biol Sci. 285(1893):20181632. - PMC - PubMed
    1. Bejerano G, et al.2004. Ultraconserved elements in the human genome. Science 304(5675):1321–1325. - PubMed
    1. Belyayev A. 2014. Bursts of transposable elements as an evolutionary driving force. J Evol Biol. 27(12):2573–2584. - PubMed
    1. Bergsten J. 2005. A review of long-branch attraction. Cladistics 21(2):163–193. - PubMed

Publication types