Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 16;70(4):844-854.
doi: 10.1093/sysbio/syaa081.

A Cautionary Note on the Use of Genotype Callers in Phylogenomics

Affiliations

A Cautionary Note on the Use of Genotype Callers in Phylogenomics

Pablo Duchen et al. Syst Biol. .

Abstract

Next-generation-sequencing genotype callers are commonly used in studies to call variants from newly sequenced species. However, due to the current availability of genomic resources, it is still common practice to use only one reference genome for a given genus, or even one reference for an entire clade of a higher taxon. The problem with traditional genotype callers, such as the one from GATK, is that they are optimized for variant calling at the population level. However, when these callers are used at the phylogenetic level, the consequences for downstream analyses can be substantial. Here, we performed simulations to compare the performance between the genotype callers of GATK and ATLAS, and present their differences at various phylogenetic scales. We show that the genotype caller of GATK substantially underestimates the number of variants at the phylogenetic level, but not at the population level. We also found that the accuracy of heterozygote calls declines with increasing distance to the reference genome. We quantified this decline and found that it is very sharp in GATK, while ATLAS maintains high accuracy even at moderately divergent species from the reference. We further suggest that efforts should be taken towards acquiring more reference genomes per species, before pursuing high-scale phylogenomic studies. [ATLAS; efficiency of SNP calling; GATK; heterozygote calling; next-generation sequencing; reference genome; variant calling.].

PubMed Disclaimer

Figures

Figure 1
Figure 1
Base-cases example trees: a birth--death tree A (left), and a recent-burst tree B (right). For each tree we generated three arbitrary rescalings: Large, Medium, and Small, each representing various types of divergence found in phylogenetic studies. The chosen reference genome is shown with an arrow. The colors of the tip labels are chosen to represent the increasing distance to the reference (these colors will be used again in Figs. 4 and 5). The command lines and parameters used to generate these trees are described in the Supplementary Material, Section A.1 available on Dryad.
Figure 2
Figure 2
Summary of the main steps taken in this study. The order of steps is given by the numbers in each box. Initial sequence simulation and indexing of the reference genome involve haploid sequences (steps 1, 2, 4, and 6). Simulation of reads, mapping, recalibration, and genotype calling involve diploid sequences (steps 3, 5, 7, 8, and 9). Arrows indicate that some output of a previous step will be used as input for a next step. Programs used, and a summary of the command lines are indicated in italics. A complete description of all command lines is given in the Supplementary Material, Section A available on Dryad.
Figure 3
Figure 3
Number of called variants for the tips of tree A (first row), and tree B (second row). The tips on the x-axis are ordered according to their distance to the reference (the reference being always at the extreme left).
Figure 4
Figure 4
Accuracy of heterozygote calling for tree A at three different phylogenetic scales: Large (first column), Medium (middle column), and Small (last column). The phylogenetic distance from each tip of the phylogeny to the reference is plotted against the heterozygote-calling accuracy (first row), and against the called versus true heterozygotes (second row). The colors of each point or symbol correspond to the tip colors shown in Fig. 1.
Figure 5
Figure 5
Accuracy of heterozygote calling for tree B at three different phylogenetic scales: Large (first column), Medium (middle column), and Small (last column). The phylogenetic distance from each tip of the phylogeny to the reference is plotted against the heterozygote-calling accuracy (first row), and against the called versus true heterozygotes (second row). The colors of each point or symbol correspond to the tip colors shown in Fig. 1.

Similar articles

Cited by

References

    1. Bateman R.M., Sramkó G., Paun O.. 2018. Integrating restriction site-associated DNA sequencing (RAD-seq) with morphological cladistic analysis clarifies evolutionary relationships among major species groups of bee orchids. Ann. Bot. 121:85–105. - PMC - PubMed
    1. Blischak P.D., Kubatko L.S., Wolfe A.D.. 2018. SNP genotyping and parameter estimation in polyploids using low-coverage sequencing data. Bioinformatics 34:407–415. - PubMed
    1. Bragg J.G., Potter S., Bi K., Moritz C.. 2016. Exon capture phylogenomics: efficacy across scales of divergence. Mol. Ecol. Resour. 16:1059–1068. - PubMed
    1. Brandrud M.K., Baar J., Lorenzo M.T., Athanasiadis A., Bateman R.M., Chase M.W., Hedrén M., Paun O.. 2020. Phylogenomic relationships of diploids and the origins of allotetraploids in Dactylorhiza (Orchidaceae). Syst. Biol. 69:91–109. - PMC - PubMed
    1. Burress E., Alda F., Duarte A., Loureiro M., Armbruster J., Chakrabarty P.. 2018. Phylogenomics of pike cichlids (Cichlidae: Crenicichla): the rapid ecological speciation of an incipient species flock. J. Evol. Biol. 31:14–30. - PubMed

Publication types