Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Nov;64(6):1032-47.
doi: 10.1093/sysbio/syv053. Epub 2015 Jul 29.

Short Tree, Long Tree, Right Tree, Wrong Tree: New Acquisition Bias Corrections for Inferring SNP Phylogenies

Affiliations

Short Tree, Long Tree, Right Tree, Wrong Tree: New Acquisition Bias Corrections for Inferring SNP Phylogenies

Adam D Leaché et al. Syst Biol. 2015 Nov.

Abstract

Single nucleotide polymorphisms (SNPs) are useful markers for phylogenetic studies owing in part to their ubiquity throughout the genome and ease of collection. Restriction site associated DNA sequencing (RADseq) methods are becoming increasingly popular for SNP data collection, but an assessment of the best practises for using these data in phylogenetics is lacking. We use computer simulations, and new double digest RADseq (ddRADseq) data for the lizard family Phrynosomatidae, to investigate the accuracy of RAD loci for phylogenetic inference. We compare the two primary ways RAD loci are used during phylogenetic analysis, including the analysis of full sequences (i.e., SNPs together with invariant sites), or the analysis of SNPs on their own after excluding invariant sites. We find that using full sequences rather than just SNPs is preferable from the perspectives of branch length and topological accuracy, but not of computational time. We introduce two new acquisition bias corrections for dealing with alignments composed exclusively of SNPs, a conditional likelihood method and a reconstituted DNA approach. The conditional likelihood method conditions on the presence of variable characters only (the number of invariant sites that are unsampled but known to exist is not considered), while the reconstituted DNA approach requires the user to specify the exact number of unsampled invariant sites prior to the analysis. Under simulation, branch length biases increase with the amount of missing data for both acquisition bias correction methods, but branch length accuracy is much improved in the reconstituted DNA approach compared to the conditional likelihood approach. Phylogenetic analyses of the empirical data using concatenation or a coalescent-based species tree approach provide strong support for many of the accepted relationships among phrynosomatid lizards, suggesting that RAD loci contain useful phylogenetic signal across a range of divergence times despite the presence of missing data. Phylogenetic analysis of RAD loci requires careful attention to model assumptions, especially if downstream analyses depend on branch lengths.

Keywords: Conditional likelihood; Phrynosoma; Phrynosomatidae; SVDquartets; ddRADseq; maximum likelihood; reconstituted DNA.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Phylogeny for Phrynosomatidae. Topology and clade names follow Leaché and McGuire (2006); Wiens et al. (2010, 2013); Nieto-Montes de Oca et al. (2014); Leaché et al. (2015); Leaché and Linkem (2015). Species numbers are shown in parentheses.
Figure 2.
Figure 2.
Species tree topology used for the simulation of RAD loci with (ADO). The pattern of ADO is illustrated for the first 100 of 1000 loci (black=present, white=ADO). For this example, the data assembly required there to be at least 8 out of 10 sequences per locus (min. ind. = 8) for the locus to be included in the alignment (i.e., the maximum amount of missing data at any locus is 20%).
Figure 3.
Figure 3.
Tree lengths estimated using acquisition bias correction are sensitive to allelic dropout. Simulations (a) and empirical data for phrynosomatid lizards (b) show similar patterns. The tree length is overestimated by the conditional likelihood correction and underestimated by the reconstituted DNA correction. The true tree length for the simulation is 0.264. Simulations without locus rate variation (a) show similar patterns to those that include rate variation.
Figure 4.
Figure 4.
Properties of simulated RAD loci with different amounts of missing data. Loci that contain more missing data tend to result in discordant topologies (a), increased branch length errors (b), and lower bootstrap support (c). Loci that contain less missing data provide higher bootstrap support for shorter branches (d).
Figure 5.
Figure 5.
Comparisons of branch lengths estimated from the empirical phrynosomatid lizard data. In comparison to the analysis of full sequences (x-axis), branch lengths are overestimated when no acquisition bias correction is used (a), overestimated with the conditional likelihood correction (b), and underestimated with the reconstituted DNA correction (c). Results are shown for the s50 data matrix, which contains 1915 variable sites.
Figure 6.
Figure 6.
Biases in branch lengths (BLs) on phylogenies for phrynosomatid lizards increase as the size of the data matrix increases. Branch colors reflect the relative BL difference between the analysis of full sequences and the conditional likelihood correction (a), and the reconstituted DNA correction (b). Positive values indicate longer branches under the acquisition correction model, and negative values indicate shorter branches under the acquisition correction model. Branches with dashed lines indicate discordant bipartitions.
Figure 7.
Figure 7.
Relative RF distances between phrynosomatid lizard topologies estimated with full sequences versus topologies estimated with SNPs with no acquisition bias correction (=uncorrected), the conditional likelihood correction, and the reconstituted DNA correction.
Figure 8.
Figure 8.
Comparison of bootstrap support values from analyses of phrynosomatid lizards using full sequences versus SNPs with no acquisition bias correction (a), the conditional likelihood correction (b), and the reconstituted DNA correction (c). Results are shown for the largest data matrix (s5). On average, analyses of SNPs tend toward slightly higher bootstrap values.
Figure 9.
Figure 9.
Phylogeny of phrynosomatid lizards based on an ML analysis of full sequences (matrix s5: 1,256,221 base pairs, 25,709 loci, and 101,937 variable sites). Bootstrap values are shown on the branches.
Figure 10.
Figure 10.
Species trees for Phrynosomatidae estimated using SVDquartets for data matrix s5 (a), s25 (b), and s50 (c). Bootstrap values (from 100 replicates) are shown on nodes.
Figure 11.
Figure 11.
RAxML search times are faster for acquisition bias correction models, especially for larger data matrices (a), and the speed increase is a result of removing thousands of distinct alignment patterns from the data matrix that are produced by the missing data (b). Compute times exclude bootstrap calculations. All analyses were run on 16-core Intel E5-2650 CPUs with 32GB of RAM.

References

    1. Arnold B., Corbett-Detig R., Hartl D., Bomblies K. 2013. RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling. Mol. Ecol. 22:3179–3190. - PubMed
    1. Baird N.A., Etter P.D., Atwood T.S., Currey M.C., Shiver A.L., Lewis Z.A., Selker E.U., Cresko W.A., Johnson E.A. 2008. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PloS ONE 3:e3376. - PMC - PubMed
    1. Bertels F., Silander O.K., Pachkov M., Rainey P.B., van Nimwegen E. 2014. Automated reconstruction of whole-genome phylogenies from short-sequence reads. Mol. Biol. Evol. 31:1077–1088. - PMC - PubMed
    1. Brandley M.C., Warren D.L., Leaché A.D., McGuire J.A. 2009. Homoplasy and clade support. Syst. Biol. 58:184–198. - PubMed
    1. Brumfield R.T., Beerli P., Nickerson D.A., Edwards S.V. 2003. The utility of single nucleotide polymorphisms in inferences of population history. Trends Ecol. Evol. 18:249–256.

Publication types

LinkOut - more resources