Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May 13;38(10):2719-2726.
doi: 10.1093/bioinformatics/btac186.

TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity

Affiliations

TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity

Marcos A Caraballo-Ortiz et al. Bioinformatics. .

Abstract

Motivation: Building reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive global efforts of sequencing genomes and reconstructing the phylogeny of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains exemplify these difficulties since there are only hundreds of phylogenetically informative sites but millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-noise ratio for more accurate and fast phylogenetic inference of resolvable phylogenetic features.

Results: We present the TopHap approach that determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods. We develop a bootstrap strategy that resamples genomes spatiotemporally to assess topological robustness. The application of TopHap to build a phylogeny of 68 057 SARS-CoV-2 genomes (68KG) from the first year of the pandemic produced an evolutionary tree of major SARS-CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence pattern of mutations and recovers key phylogenetic relationships from more traditional analyses. We also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-CoV-2 genomes. An application of TopHap to more than 1 million SARS-CoV-2 genomes reconstructed the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG phylogeny and provided evolutionary origins of major and recent variants of concern.

Availability and implementation: TopHap is available at https://github.com/SayakaMiura/TopHap.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Traditional phylogenetic approach versus the new TopHap approach for a dataset that contains many sequences with few variants. (a) The true tree shows three simulated mutant haplotypes. In this example, three mutations (α, β and γ) occurred sequentially and gave rise to haplotypes H1, H2 and H3. The size of triangles at each tip is proportional to the number of genomes containing these haplotypes. (b) Phylogenetic approaches use a MSA, simplified here with only three informative variants. Due to sequencing errors, a few spurious haplotypes may be observed (H4–H6) with low frequencies (0.3–1%). The inclusion of these spurious haplotypes misguides standard phylogeny methods (e.g. ML and MP) and produces incorrect evolutionary inference. (c) Result based on a typical ML approach suggests that the spurious haplotypes H6 and H5 were the first to arise. The bootstrap confidence limits for all the branching patterns are low (<50%) because each branch is only one mutation long, a situation where the bootstrap method is known to be powerless (see text). (d) The TopHap approach was able to infer the correct tree because it restricts phylogenetic analysis to haplotypes >1% frequency
Fig. 2.
Fig. 2.
Overview of the TopHap approach. Input to TopHap is an alignment of genome sequences (n sequences, m bases each). TopHap first identifies high-frequency variants (>maf) and produces a restricted alignment with n sequences and k bases. Next, high-frequency haplotypes (>hf) are identified, resulting in a reduced alignment of h haplotypes each with k bases. These haplotypes are subjected to standard phylogenetic inference. To compute bootstrap confidence limits, TopHap resamples n haplotypes with replacement to form a replicate n × k dataset, which is followed by the identification of high-frequency haplotypes (>hf) and the inference of their phylogeny. This process is repeated for the desired number of bootstrap replicates and a consensus phylogeny of haplotypes found in all replicates is produced. Spatiotemporal information can also be used to construct subsets in which variants and haplotypes are identified for each spatiotemporal slice separately (see Supplementary Fig. S1)
Fig. 3.
Fig. 3.
The TopHap phylogeny of 68KG SARS-CoV-2 major haplotypes. Numbers near nodes are bootstrap confidence limits derived from bootstrap resampling of genomes. Mutations mapped are shown on branches. When the same mutations were included in Kumar et al. (2021), their mutation IDs (Greek symbols) were shown. Their mutations and genomic positions are given in the right side. The Nextstrain clade ID was annotated based on their diagnostic mutations and is provided at the far right. PANGO lineage was annotated for each genome using PANGOLIN software (Rambaut et al., 2020). We also annotated TopHap haplotype for each genome by comparing its haplotype with TopHap haplotypes. When an observed haplotype did not perfectly match any of the TopHap haplotypes, we did not assign any for the genome. Using these genome annotations, we paired each TopHap haplotype with the major PANGO lineage, and the percentage of genomes containing it is presented in the parenthesis
Fig. 4.
Fig. 4.
The number of branches from the root to a tip and global mutant nucleotide frequency (a) and the first time the mutation was observed (b). Numbers are the tip identifiers from Figure 3. The same color code was used in (b). Days are counted from the first sample date (December 24, 2019)
Fig. 5.
Fig. 5.
The comparison of TopHap phylogeny with the (a) Nextstrain and (b) PANGO phylogenies. (a) Only clades included in the 68KG data are shown. (b) Only PANGO lineages that were included in the TopHap phylogeny were used. Corresponding PANGO IDs are found in Figure 3
Fig. 6.
Fig. 6.
The 1MG TopHap Phylogeny. (a) Numbers near nodes are bootstrap confidence limits derived from bootstrap resampling of genomes. Early mutations that were predicted in Kumar et al. (2021) are shown on branches using their mutation IDs (Greek symbols). Their mutations and genomic positions are given in Figure 3. The haplotypes with concerning mutations are indicated by using WHO IDs, and 20A EU2 and 20E (EU1) are Nextstrain clade IDs. These haplotypes were identified by annotating PANGO and Nextstrain lineage for each genome. We also annotated TopHap haplotype for each genome by comparing its haplotype with TopHap haplotypes. When an observed haplotype did not perfectly match any of the TopHap haplotypes, we did not assign any for the genome. Using these genome annotations, we paired each TopHap haplotype with the major PANGO and Nextstrain lineage, which contained the WHO annotation. We assigned WHO ID when at least one of the annotations indicated it. Evolutionary relationship of lineages with concerning mutations by (b) Nextstrain and (c) TopHap
Fig. 7.
Fig. 7.
The early history of SARS-CoV-2 variants. Five root positions are explored in which the haplotype with mutation x has been added to the TopHap phylogeny in Figure 3  (a and b), Kumar et al. (2021) mutational history (b), Bloom (2021) phylogeny (b and c), PANGO classification (d) and the Nextstrain classification (e). Haplotypes have eight positions that contain variants α1–α3, β1–β3, ν1–ν2 and x. Genomic positions are shown whenever a mutation occurs: green highlighted box with a letter for forward and red highlighted box without a letter for backward mutations. Using the MP criteria, we placed the haplotype with x variant into each phylogeny. TopHap had two equally parsimonious solutions (a and b), where the ML placement predicted scenario A. ML lnL and the number of MP substitutions are shown. WH-1 is the haplotype corresponding to the Wuhan-1 genome. The gray triangle represents all the other SARS-CoV-2 haplotypes of the ongoing infections in the world

References

    1. Andersen K.G. et al. (2020) The proximal origin of SARS-CoV-2. Nat. Med., 26, 450–452. - PMC - PubMed
    1. Berger S.A. et al. (2011) Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Syst. Biol., 60, 291–302. - PMC - PubMed
    1. Bloom J.D. (2021) Recovery of deleted deep sequencing data sheds more light on the early Wuhan SARS-CoV-2 epidemic. Mol. Biol. Evol., 38, 5211–5224. - PMC - PubMed
    1. Bouckaert R.R. (2010) DensiTree: making sense of sets of phylogenetic trees. Bioinformatics, 26, 1372–1373. - PubMed
    1. Felsenstein J. (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution, 39, 783–791. - PubMed

Publication types