Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 22;14(10):1837.
doi: 10.3390/genes14101837.

SNPtotree-Resolving the Phylogeny of SNPs on Non-Recombining DNA

Affiliations

SNPtotree-Resolving the Phylogeny of SNPs on Non-Recombining DNA

Zehra Köksal et al. Genes (Basel). .

Abstract

Genetic variants on non-recombining DNA and the hierarchical order in which they accumulate are commonly of interest. This variant hierarchy can be established and combined with information on the population and geographic origin of the individuals carrying the variants to find population structures and infer migration patterns. Further, individuals can be assigned to the characterized populations, which is relevant in forensic genetics, genetic genealogy, and epidemiologic studies. However, there is currently no straightforward method to obtain such a variant hierarchy. Here, we introduce the software SNPtotree v1.0, which uniquely determines the hierarchical order of variants on non-recombining DNA without error-prone manual sorting. The algorithm uses pairwise variant comparisons to infer their relationships and integrates the combined information into a phylogenetic tree. Variants that have contradictory pairwise relationships or ambiguous positions in the tree are removed by the software. When benchmarked using two human Y-chromosomal massively parallel sequencing datasets, SNPtotree outperforms traditional methods in the accuracy of phylogenetic trees for sequencing data with high amounts of missing information. The phylogenetic trees of variants created using SNPtotree can be used to establish and maintain publicly available phylogeny databases to further explore genetic epidemiology and genealogy, as well as population and forensic genetics.

Keywords: SNPs; evolutionary genetics; haploid markers; non-recombining DNA; phylogenetic tree; population genetics; software.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Workflow of the SNPtotree algorithm. An input file of the ancestral (A) or derived (D) allelic states of the polymorphic sites is required. Missing data should be indicated using an “X”. The header row represents the individuals’ labels (S1, S2, and S3), and the first column represents the variant names (m1, m2, m3, and m4). The algorithm consists of three steps: first (1), all pairs of variants are compared to each other to predict the pairwise relationships. Variants with contradictory relationships are removed. Second (2), variants that are not separable are predicted to be equal. During the process of finding equal variants, variants with ambiguous positions in the tree are removed. Finally (3), the hierarchical variant order is inferred, and the phylogenetic tree is generated. Additional output files (see Section 2) provide the statistical support values for each variant in the tree, and metadata specifies the sequences carrying the respective variants in the different branches of the csv output tree. Optional output files are marked with an asterisk.
Figure 2
Figure 2
To determine whether variants M1 and M3 are (A) upstream/downstream, (B) parallel, or (C) equal to each other, (D) their pairwise relationships are assessed in a two-way comparison. Among all sequences (S1–S12), only those sequences that have a derived allelic state for M1 (or M3 in the second comparison) are considered. All observed allelic states of the remaining variant M3 (or M1) are documented, and the resulting relationships are compared. The consensus relationship defines the final pairwise relationship between variants M1 and M3.
Figure 3
Figure 3
Examples of how SNPtotree combines equal variants and removes variants with ambiguous positions in the tree caused by a lack of representative data, e.g., due to missing information. The relationships between the 15 variants (M1 to M15) are based on the dataset in Table 1. Variants reported in the same sequence but presented in parallel branches are highlighted in gray boxes. In the first example (1), M1 and M4 are always found in the same allelic state in the sequences where both variants have been typed. Since M1 and M4 have the same downstream variants (M2 and M3), SNPtotree will consider M1 and M4 as equal. Furthermore, SNPtotree universally joins all branch tip variants (i.e., without any downstream variants) that are sharing their immediate upstream variant to groups of “equal” variants. Thus, M2 and M3 are considered equal as well. In the second example (2), M5 and M8 share one of their downstream variants (M6), and M5 has the unique downstream variant (M7). To avoid double entries, the upstream variant with fewer downstream variants (M8) is removed if its relationships to the residual variants (M5, M7) are unknown. In the third example (3), variants M9 and M13 share the downstream variant M11. Additionally, M9 and M13 have unique downstream variants (M10, M12, and M14). However, there is insufficient information connecting the two subtrees. To avoid the introduction of incorrect phylogenetic relationships, SNPtotree removes variants with several possible positions in the tree (M13). The final example (4) presents a rule for variants (M10 and M15), which were reported in the same state in all sequences but did not share any downstream variants. In this example, M10, M12, and M15 are downstream of M9 (Table 1). However, the relationship between M15 and the other variants is unknown because of missing data from M15 in some sequences. SNPtotree removes M15 to maintain maximum depth in the tree.
Figure 4
Figure 4
(A) True phylogeny of the 22 variants given in testdata 1 (taken from the ISOGG Y-DNA Haplogroup Tree 2019–2020). Clade names, which precede the SNP names, are highlighted in bold. Please note that a speciation event results in a split into at least two sister lineages. Testdata 1 only contained a small subset of clade Q lineages, and only these were presented here. (B) Phylogeny of clade Q SNPs and lineages resulting from the SNPtotree analysis. (C) Phylogeny of clade Q SNPs and lineages resulting from ML tree construction and manual sorting.
Figure 5
Figure 5
(A) True phylogeny of the 21 subclades determined by all known SNPs from testdata 2 within the human Y-chromosomal clade C1b1 (phylogenies taken from ISOGG Y-DNA Haplogroup Tree 2019–2020). The SNP names were omitted to present the tree in a simple way. (B) Phylogeny of clade C1b1 SNPs and subbranches resulting from SNPtotree analysis. (C) Phylogeny of clade C1b1 SNPs and subbranches resulting from ML tree construction. The nested clades with high BS values composed of sequences with low missing data were colored purple or green, corresponding to the clades in Supplementary Figure S3.

References

    1. Ishikawa S.A., Zhukova A., Iwasaki W., Gascuel O. A Fast Likelihood Method to Reconstruct and Visualize Ancestral Scenarios. Mol. Biol. Evol. 2019;36:2069–2085. doi: 10.1093/molbev/msz131. - DOI - PMC - PubMed
    1. Joy J.B., Liang R.H., McCloskey R.M., Nguyen T., Poon A.F.Y. Ancestral Reconstruction. PLoS Comput. Biol. 2016;12:e1004763. doi: 10.1371/journal.pcbi.1004763. - DOI - PMC - PubMed
    1. Guyeux C., Al-Nuaimi B., AlKindy B., Couchot J.-F., Salomon M. On the Reconstruction of the Ancestral Bacterial Genomes in Genus Mycobacterium and Brucella. BMC Syst. Biol. 2018;12:100. doi: 10.1186/s12918-018-0618-2. - DOI - PMC - PubMed
    1. Lemey P., Rambaut A., Drummond A.J., Suchard M.A. Bayesian Phylogeography Finds Its Roots. PLoS Comput. Biol. 2009;5:e1000520. doi: 10.1371/journal.pcbi.1000520. - DOI - PMC - PubMed
    1. King T.E., Jobling M.A. What’s in a Name? Y Chromosomes, Surnames and the Genetic Genealogy Revolution. Trends Genet. 2009;25:351–360. doi: 10.1016/j.tig.2009.06.003. - DOI - PubMed

LinkOut - more resources