. 2020 Aug:82:104277.

doi: 10.1016/j.meegid.2020.104277. Epub 2020 Mar 6.

Evaluation of haplotype callers for next-generation sequencing of viruses

Anton Eliseev¹, Keylie M Gibson², Pavel Avdeyev³, Dmitry Novik¹, Matthew L Bendall⁴, Marcos Pérez-Losada⁵, Nikita Alexeev¹, Keith A Crandall⁶

Affiliations

¹ Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia.
² Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA. Electronic address: kmgibson@gwu.edu.
³ Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA.
⁴ Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
⁵ Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal.
⁶ Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.

PMID: 32151775
PMCID: PMC7293574
DOI: 10.1016/j.meegid.2020.104277

Evaluation of haplotype callers for next-generation sequencing of viruses

Anton Eliseev et al. Infect Genet Evol. 2020 Aug.

. 2020 Aug:82:104277.

doi: 10.1016/j.meegid.2020.104277. Epub 2020 Mar 6.

Authors

Anton Eliseev¹, Keylie M Gibson², Pavel Avdeyev³, Dmitry Novik¹, Matthew L Bendall⁴, Marcos Pérez-Losada⁵, Nikita Alexeev¹, Keith A Crandall⁶

Affiliations

¹ Computer Technologies Laboratory, ITMO University, Saint-Petersburg, Russia.
² Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA. Electronic address: kmgibson@gwu.edu.
³ Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Mathematics, George Washington University, Washington, DC, USA.
⁴ Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.
⁵ Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão, Portugal.
⁶ Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Washington, DC, USA; Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, USA.

PMID: 32151775
PMCID: PMC7293574
DOI: 10.1016/j.meegid.2020.104277

Abstract

Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. Previous benchmarks of viral haplotype reconstruction programs used simulation scenarios that are useful from a mathematical perspective but do not reflect viral evolution and epidemiology. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. We simulated coalescent-based populations that spanned known levels of viral genetic diversity, including mutation rates, sample size and effective population size, to test the limits of the haplotype reconstruction methods and to ensure coverage of predicted intra-host viral diversity levels (especially HIV-1). All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction quality was highly variable and, on average, poor. All haplotype reconstruction tools, except QuasiRecomb and ShoRAH, greatly underestimated intra-host diversity and the true number of haplotypes. PredictHaplo outperformed, in regard to highest precision, recall, and lowest UniFrac distance values, the other haplotype reconstruction tools followed by CliqueSNV, which, given more computational time, may have outperformed PredictHaplo. Here, we present an extensive comparison of available viral haplotype reconstruction tools and provide insights for future improvements in haplotype reconstruction tools using both short-read and long-read technologies.

Keywords: Fast-evolving viruses; HIV; Haplotype reconstruction; Intra-host diversity; Next-generation sequencing; Simulations.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no competing interests.

Figures

**Figure 1.**
Schematic diagram representing the process of reconstructing haplotypes from next-generation sequencing reads by reference-based and *de novo* methods. (a) A hypothetical virus population consisting of three haplotypes is sequenced by NGS techniques. (b) Reads originating from different haplotypes are identified by distinct colors. (c_l) After sequencing, reads are aligned against reference genome (green) as a first step in all reference-based methods. (c₂) Read alignment is used for building a graph and candidate haplotypes are reconstructed as maximal cliques in the graph. (c₃) Read alignment is used for dividing reads into clusters and candidate haplotypes are reconstructed by concatenation of all reads from clusters. (c₄) Alternatively, after sequencing, reads are *de novo* assembled using De Bruijn or overlap graphs and candidate haplotypes are reconstructed as paths by analyzing the graph structure. (d₁) A method based on clique detections overestimates the number of reconstructed haplotypes with relative frequencies. (d₂) A method based on clustering procedure underestimates the number of reconstructed haplotypes with relative frequencies. (d₃) A *de novo* method reconstructs the correct number of haplotypes with frequencies, but one inferred haplotype is smaller than the true haplotype and the other two haplotypes are admixtures of the original haplotypes.

**Figure 2.**
Simulated population parameters with the haplotype count in each parameter box. All five population replicates are displayed. The color darkens as the number of haplotypes increases. The red dashed box indicates expected HIV-1 mutation rates and effective population sizes.

**Figure 3.**
Reference-based haplotype callers: variation of precision values with mutation rate (log-scaled) for all considered N_e. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and N_e, we report the mean estimates of precision over all valid outputs produced by each software tool for the five haplotype populations. If a tool did not produce any output for any pair of parameters, we included a gap in the corresponding plot.

**Figure 4.**
Reference-based haplotype callers: variation of recall values with mutation rate (log-scaled) for all considered N_e. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and N_e, we report the mean estimates of recall over all valid outputs produced by each software tool for five haplotype populations. If a tool did not produce any output for some pair of parameters, we included a gap in the corresponding plot.

**Figure 5.**
Reference-based haplotype callers: variation of UniFrac distances (EMD) with mutation rate (log-scaled) for all considered N_e. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and N_e, we report the mean estimates of UniFrac distances over all valid outputs produced by each software tool for five haplotype populations. If a tool did not produce any output for some pair of parameters, we included a gap in the corresponding plot.

**Figure 6.**
Reference-based haplotype callers: Number of predicted haplotypes across levels of underlying genetic diversity. Diversity is measured by Watterson’s theta (θ = 2N_eμ.) increasing from left to right. The number of haplotypes reconstructed changes with each tool and is therefore variable across the y-axes. Intra-host HIV-1 and HCV diversity levels are shaded in light blue and shaded light red regions, respectively. The purple shaded area represents the overlap between HIV-1 and HCV diversity levels. If a software tool did not complete haplotype reconstruction within the given time frame, we included a gap in the corresponding plot (see Fig. S1 for more information on dataset completions). For each data point, the color of the data point corresponds to the alignment rate, which changes color from blue (0% alignment rate) to yellow (100% alignment rate), and the size of the data point corresponds to the mutation rate, which increases in size as the mutation rate becomes higher. At lower diversity estimates (θ), the number of haplotypes reconstructed is relatively low for all haplotype callers. As the diversity estimates increase, the haplotype callers reconstruct more haplotypes until the level of genetic diversity becomes too high and the haplotype callers then have trouble reconstructing haplotypes at this high level of diversity (seen by the downward tread on the right side of each graph with low alignment rates). This trend is not unexpected, as high diversity samples have poor alignment rates thus affecting haplotype reconstructing. For ground truth, see Fig. S3.

**Figure 7.**
Phylogenetics trees of “true” haplotypes from four simulated datasets. Phylogeny built with maximum-likelihood method (RAxML) and branch length is measured by the number of substitutions per site. (a) dataset with low genetic diversity chosen from our simulation set with parameters N_e = 10000 and μ = 1e – 7; (b) dataset with low Hamming distance simulated as described in Leviyang et al. (2017); (c) dataset with high genetic diversity chosen from our simulation set with parameters N_e = 7500 and *μ =* 5e – 6; (d) dataset with high Hamming distance simulated as described in Leviyang et al. (2017).

**Figure 8.**
*De novo* haplotype callers: variation of precision values with mutation rate (log-scaled) for all considered N_e. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and N_e, we report the mean estimates of precision over all valid outputs produced by each software tool for five haplotype populations. If a tool did not produce any output for some pair of parameters, we included a gap in the corresponding plot.

**Figure 9.**
*De novo* haplotype callers: variation of recall values with the mutation rate (log-scaled) for all considered N_e. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and N_e, we report the mean estimates of recall over all valid outputs produced by each software tool for five haplotype populations. If a tool did not produce any output for some pair of parameters, we included a gap in the corresponding plot.

**Figure 10.**
*De novo* haplotype callers: variation of UniFrac distance (EMD) with mutation rate (log-scaled) for all considered N_e. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and N_e, we report the mean estimates of UniFrac distances over all valid outputs produced by each software tool for five haplotype populations. If a tool did not produce any output for some pair of parameters, we included a gap in the corresponding plot.

**Figure 11.**
Number of predicted haplotypes across levels of underlying genetic diversity for *de novo* haplotype callers. Diversity is measured by Watterson’s theta (θ = 2N_eμ.) increasing from left to right. The number of haplotypes reconstructed changes with each tool and is therefore variable across the y-axes. Intra-host HIV-1 and HCV diversity levels are shaded in light blue and shaded light red regions, respectively. The purple shaded area represents the overlap between HIV-1 and HCV diversity levels. If a software tool did not complete haplotype reconstruction within the given time frame, we included a gap in the corresponding plot (see Fig. S1 for more information on dataset completions). For each data point, the color of the data point corresponds to the alignment rate, which changes color from blue (0% alignment rate) to yellow (100% alignment rate), and the size of the data point corresponds to the mutation rate, which increases in size as the mutation rate becomes higher. At lower diversity estimates (θ), the number of haplotypes reconstructed is relatively low for all haplotype callers. As the diversity estimates increase, the haplotype callers reconstruct more haplotypes until the level of genetic diversity becomes too high and the haplotype callers then have trouble reconstructing haplotypes at this high level of diversity (seen by the downward tread on the right side of each graph with low alignment rates). This trend is not unexpected, as high diversity samples have poor alignment rates thus affecting haplotype reconstructing. For ground truth, see Fig. S3. To address the possibility of false positives, especially for PEHaplo, see Fig. 9.

See this image and copyright information in PMC

Cited by

VirStrain: a strain identification tool for RNA viruses.
Liao H, Cai D, Sun Y. Liao H, et al. Genome Biol. 2022 Jan 31;23(1):38. doi: 10.1186/s13059-022-02609-x. Genome Biol. 2022. PMID: 35101081 Free PMC article.
Novel variants underlying autosomal recessive neurodevelopmental disorders with intellectual disability in Iranian consanguineous families.
Moudi M, Vahidi Mehrjardi MY, Hozhabri H, Metanat Z, Kalantar SM, Taheri M, Ghasemi N, Dehghani M. Moudi M, et al. J Clin Lab Anal. 2022 Feb;36(2):e24241. doi: 10.1002/jcla.24241. Epub 2022 Jan 12. J Clin Lab Anal. 2022. PMID: 35019165 Free PMC article.
V-pipe 3.0: a sustainable pipeline for within-sample viral genetic diversity estimation.
Fuhrmann L, Jablonski KP, Topolsky I, Batavia AA, Borgsmüller N, Baykal PI, Carrara M, Chen C, Dondi A, Dragan M, Dreifuss D, John A, Langer B, Okoniewski M, du Plessis L, Schmitt U, Singer F, Stadler T, Beerenwinkel N. Fuhrmann L, et al. Gigascience. 2024 Jan 2;13:giae065. doi: 10.1093/gigascience/giae065. Gigascience. 2024. PMID: 39347649 Free PMC article.
VILOCA: sequencing quality-aware viral haplotype reconstruction and mutation calling for short-read and long-read data.
Fuhrmann L, Langer B, Topolsky I, Beerenwinkel N. Fuhrmann L, et al. NAR Genom Bioinform. 2024 Nov 28;6(4):lqae152. doi: 10.1093/nargab/lqae152. eCollection 2024 Dec. NAR Genom Bioinform. 2024. PMID: 39633724 Free PMC article.
Quantifying In-Host Quasispecies Evolution.
Gregori J, Ibañez-Lligoña M, Quer J. Gregori J, et al. Int J Mol Sci. 2023 Jan 9;24(2):1301. doi: 10.3390/ijms24021301. Int J Mol Sci. 2023. PMID: 36674827 Free PMC article.

See all "Cited by" articles

References

1. Ahn S, Ke Z, Vikalo H, 2018. Viral quasispecies reconstruction via tensor factorization with successive read removal. Bioinformatics 34, i23–i31. 10.1093/bioinformatics/bty291 - DOI - PMC - PubMed
1. Ahn S, Vikalo H, 2017. aBayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity. J. Comput. Biol 25, 637–648. 10.1089/cmb.2017.0249 - DOI - PubMed
1. Alnafee K, 2016. Versatile and open software for comparing large genomes 5, 12–23. 10.1186/gb-2004-5-2-r12 - DOI - PMC - PubMed
1. Arenas M, Posada D, 2014. Simulation of genome-wide evolution under heterogeneous substitution models and complex multispecies coalescent histories. Mol. Biol. Evol 31, 1295–1301. 10.1093/molbev/msu078 - DOI - PMC - PubMed
1. Astrovskaya I, Tork B, Mangul S, Westbrooks K, Mandoiu I, Balfe P, Zelikovsky A, 2011. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics 12 10.1186/1471-2105-12-S6-S1 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of haplotype callers for next-generation sequencing of viruses

Affiliations

Evaluation of haplotype callers for next-generation sequencing of viruses

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources