Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug:82:104277.
doi: 10.1016/j.meegid.2020.104277. Epub 2020 Mar 6.

Evaluation of haplotype callers for next-generation sequencing of viruses

Affiliations

Evaluation of haplotype callers for next-generation sequencing of viruses

Anton Eliseev et al. Infect Genet Evol. 2020 Aug.

Abstract

Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.

Keywords: Fast-evolving viruses; HIV; Haplotype reconstruction; Intra-host diversity; Next-generation sequencing; Simulations.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no competing interests.

Figures

Figure 1.
Figure 1.
Schematic diagram representing the process of reconstructing haplotypes from next-generation sequencing reads by reference-based and de novo methods. (a) A hypothetical virus population consisting of three haplotypes is sequenced by NGS techniques. (b) Reads originating from different haplotypes are identified by distinct colors. (cl) After sequencing, reads are aligned against reference genome (green) as a first step in all reference-based methods. (c2) Read alignment is used for building a graph and candidate haplotypes are reconstructed as maximal cliques in the graph. (c3) Read alignment is used for dividing reads into clusters and candidate haplotypes are reconstructed by concatenation of all reads from clusters. (c4) Alternatively, after sequencing, reads are de novo assembled using De Bruijn or overlap graphs and candidate haplotypes are reconstructed as paths by analyzing the graph structure. (d1) A method based on clique detections overestimates the number of reconstructed haplotypes with relative frequencies. (d2) A method based on clustering procedure underestimates the number of reconstructed haplotypes with relative frequencies. (d3) A de novo method reconstructs the correct number of haplotypes with frequencies, but one inferred haplotype is smaller than the true haplotype and the other two haplotypes are admixtures of the original haplotypes.
Figure 2.
Figure 2.
Simulated population parameters with the haplotype count in each parameter box. All five population replicates are displayed. The color darkens as the number of haplotypes increases. The red dashed box indicates expected HIV-1 mutation rates and effective population sizes.
Figure 3.
Figure 3.
Reference-based haplotype callers: variation of precision values with mutation rate (log-scaled) for all considered Ne. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean estimates of precision over all valid outputs produced by each software tool for the five haplotype populations. If a tool did not produce any output for any pair of parameters, we included a gap in the corresponding plot.
Figure 4.
Figure 4.
Reference-based haplotype callers: variation of recall values with mutation rate (log-scaled) for all considered Ne. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean estimates of recall over all valid outputs produced by each software tool for five haplotype populations. If a tool did not produce any output for some pair of parameters, we included a gap in the corresponding plot.
Figure 5.
Figure 5.
Reference-based haplotype callers: variation of UniFrac distances (EMD) with mutation rate (log-scaled) for all considered Ne. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean estimates of UniFrac distances over all valid outputs produced by each software tool for five haplotype populations. If a tool did not produce any output for some pair of parameters, we included a gap in the corresponding plot.
Figure 6.
Figure 6.
Reference-based haplotype callers: Number of predicted haplotypes across levels of underlying genetic diversity. Diversity is measured by Watterson’s theta (θ = 2Neμ.) increasing from left to right. The number of haplotypes reconstructed changes with each tool and is therefore variable across the y-axes. Intra-host HIV-1 and HCV diversity levels are shaded in light blue and shaded light red regions, respectively. The purple shaded area represents the overlap between HIV-1 and HCV diversity levels. If a software tool did not complete haplotype reconstruction within the given time frame, we included a gap in the corresponding plot (see Fig. S1 for more information on dataset completions). For each data point, the color of the data point corresponds to the alignment rate, which changes color from blue (0% alignment rate) to yellow (100% alignment rate), and the size of the data point corresponds to the mutation rate, which increases in size as the mutation rate becomes higher. At lower diversity estimates (θ), the number of haplotypes reconstructed is relatively low for all haplotype callers. As the diversity estimates increase, the haplotype callers reconstruct more haplotypes until the level of genetic diversity becomes too high and the haplotype callers then have trouble reconstructing haplotypes at this high level of diversity (seen by the downward tread on the right side of each graph with low alignment rates). This trend is not unexpected, as high diversity samples have poor alignment rates thus affecting haplotype reconstructing. For ground truth, see Fig. S3.
Figure 7.
Figure 7.
Phylogenetics trees of “true” haplotypes from four simulated datasets. Phylogeny built with maximum-likelihood method (RAxML) and branch length is measured by the number of substitutions per site. (a) dataset with low genetic diversity chosen from our simulation set with parameters Ne = 10000 and μ = 1e – 7; (b) dataset with low Hamming distance simulated as described in Leviyang et al. (2017); (c) dataset with high genetic diversity chosen from our simulation set with parameters Ne = 7500 and μ = 5e – 6; (d) dataset with high Hamming distance simulated as described in Leviyang et al. (2017).
Figure 8.
Figure 8.
De novo haplotype callers: variation of precision values with mutation rate (log-scaled) for all considered Ne. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean estimates of precision over all valid outputs produced by each software tool for five haplotype populations. If a tool did not produce any output for some pair of parameters, we included a gap in the corresponding plot.
Figure 9.
Figure 9.
De novo haplotype callers: variation of recall values with the mutation rate (log-scaled) for all considered Ne. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean estimates of recall over all valid outputs produced by each software tool for five haplotype populations. If a tool did not produce any output for some pair of parameters, we included a gap in the corresponding plot.
Figure 10.
Figure 10.
De novo haplotype callers: variation of UniFrac distance (EMD) with mutation rate (log-scaled) for all considered Ne. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean estimates of UniFrac distances over all valid outputs produced by each software tool for five haplotype populations. If a tool did not produce any output for some pair of parameters, we included a gap in the corresponding plot.
Figure 11.
Figure 11.
Number of predicted haplotypes across levels of underlying genetic diversity for de novo haplotype callers. Diversity is measured by Watterson’s theta (θ = 2Neμ.) increasing from left to right. The number of haplotypes reconstructed changes with each tool and is therefore variable across the y-axes. Intra-host HIV-1 and HCV diversity levels are shaded in light blue and shaded light red regions, respectively. The purple shaded area represents the overlap between HIV-1 and HCV diversity levels. If a software tool did not complete haplotype reconstruction within the given time frame, we included a gap in the corresponding plot (see Fig. S1 for more information on dataset completions). For each data point, the color of the data point corresponds to the alignment rate, which changes color from blue (0% alignment rate) to yellow (100% alignment rate), and the size of the data point corresponds to the mutation rate, which increases in size as the mutation rate becomes higher. At lower diversity estimates (θ), the number of haplotypes reconstructed is relatively low for all haplotype callers. As the diversity estimates increase, the haplotype callers reconstruct more haplotypes until the level of genetic diversity becomes too high and the haplotype callers then have trouble reconstructing haplotypes at this high level of diversity (seen by the downward tread on the right side of each graph with low alignment rates). This trend is not unexpected, as high diversity samples have poor alignment rates thus affecting haplotype reconstructing. For ground truth, see Fig. S3. To address the possibility of false positives, especially for PEHaplo, see Fig. 9.

Similar articles

Cited by

References

    1. Ahn S, Ke Z, Vikalo H, 2018. Viral quasispecies reconstruction via tensor factorization with successive read removal. Bioinformatics 34, i23–i31. 10.1093/bioinformatics/bty291 - DOI - PMC - PubMed
    1. Ahn S, Vikalo H, 2017. aBayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity. J. Comput. Biol 25, 637–648. 10.1089/cmb.2017.0249 - DOI - PubMed
    1. Alnafee K, 2016. Versatile and open software for comparing large genomes 5, 12–23. 10.1186/gb-2004-5-2-r12 - DOI - PMC - PubMed
    1. Arenas M, Posada D, 2014. Simulation of genome-wide evolution under heterogeneous substitution models and complex multispecies coalescent histories. Mol. Biol. Evol 31, 1295–1301. 10.1093/molbev/msu078 - DOI - PMC - PubMed
    1. Astrovskaya I, Tork B, Mangul S, Westbrooks K, Mandoiu I, Balfe P, Zelikovsky A, 2011. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics 12 10.1186/1471-2105-12-S6-S1 - DOI - PMC - PubMed

Publication types

LinkOut - more resources