. 2017 Aug 15;33(16):2455-2463.

doi: 10.1093/bioinformatics/btx187.

A penalized regression approach to haplotype reconstruction of viral populations arising in early HIV/SIV infection

Sivan Leviyang¹, Igor Griva², Sergio Ita³, Welkin E Johnson⁴

Affiliations

¹ Department of Mathematics and Statistics, Georgetown University, Washington DC, 20057, USA.
² Department of Mathematics, George Mason University, Fairfax, VA 22030, USA.
³ Department of Medicine, University of California - San Diego, La Jolla, CA 92093, USA.
⁴ Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

PMID: 28379346
PMCID: PMC5870767
DOI: 10.1093/bioinformatics/btx187

A penalized regression approach to haplotype reconstruction of viral populations arising in early HIV/SIV infection

Sivan Leviyang et al. Bioinformatics. 2017.

. 2017 Aug 15;33(16):2455-2463.

doi: 10.1093/bioinformatics/btx187.

Authors

Sivan Leviyang¹, Igor Griva², Sergio Ita³, Welkin E Johnson⁴

Affiliations

¹ Department of Mathematics and Statistics, Georgetown University, Washington DC, 20057, USA.
² Department of Mathematics, George Mason University, Fairfax, VA 22030, USA.
³ Department of Medicine, University of California - San Diego, La Jolla, CA 92093, USA.
⁴ Department of Biology, Boston College, Chestnut Hill, MA 02467, USA.

PMID: 28379346
PMCID: PMC5870767
DOI: 10.1093/bioinformatics/btx187

Abstract

Motivation: Next generation sequencing (NGS) has been increasingly applied to characterize viral evolution during HIV and SIV infections. In particular, NGS datasets sampled during the initial months of infection are characterized by relatively low levels of diversity as well as convergent evolution at multiple loci dispersed across the viral genome. Consequently, fully characterizing viral evolution from NGS datasets requires haplotype reconstruction across large regions of the viral genome. Existing haplotype reconstruction algorithms have not been developed with the particular characteristics of early HIV/SIV infection in mind, raising the possibility that better performance could be achieved through a specifically designed algorithm.

Results: Here, we introduce a haplotype reconstruction algorithm, RegressHaplo, specifically designed for low diversity and convergent evolution regimes. The algorithm uses a penalized regression that balances a data fitting term with a penalty term that encourages solutions with few haplotypes. The regression covariates are a large set of potential haplotypes and fitting the regression is made computationally feasible by the low diversity setting. Using simulated and in vivo datasets, we compare RegressHaplo to PredictHaplo and QuRe, two existing haplotype reconstruction algorithms. RegressHaplo performs better than these algorithms on simulated datasets with relatively low diversity levels. We suggest RegressHaplo as a novel tool for the investigation of early infection HIV/SIV datasets and, more generally, low diversity viral NGS datasets.

Contact: sr286@georgetown.edu.

Availability and implementation: https://github.com/SLeviyang/RegressHaplo.

PubMed Disclaimer

Figures

**Fig. 1**
Precision versus recall of haplotype reconstruction in a low diversity setting. Datasets D1–D4 had identical levels of diversity, 0.6%, and were identical in all other parameters except for error rates of 1.5%, 1.0%, 0.5%, and 0%, respectively. Each datapoint and number above it represent the recall/precision and number of haplotypes (rounded), respectively, averaged over the 10 simulations in the dataset. Cross-bars on each datapoint give the precision and recall SEs. For D1–D3, results annotated with an F are PredictHaplo and QuRe reconstructions using RegressHaplo’s error correction step, see text for details. Cross-bars for these results have been suppressed for readability but were similar to results without RegressHaplo error correction. A haplotype was counted as recovering a simulated haplotype if the Hamming distance between the two was 2 or less. PH, PredictHaplo; QR, QuRe; RH, RegressHaplo

**Fig. 2**
Precision versus recall of haplotype reconstruction in a high diversity setting. Datasets D5 and D6 had diversity levels of 1.6% and read error rates of 1.5% and 0%, respectively. Dataset D7 was identical to D6, except that a long conserved region was introduced, see text for details

**Fig. 3**
Precision versus recall of haplotype reconstruction for paired-end datasets. Datasets D8 and D9 had low diversity levels, identical to dataset D2, except that D8 and D9 were constructed with paired-end reads and D9 had a long conserved region inserted. The pair-end reads collectively covered 450 nucleotides versus 250 nucleotides covered by the single-end reads in D2. The panel shown for D2 is identical to the D2 panel in Figure 1

**Fig. 4**
Single position errors. For each variable position on the reference, we calculated the estimated and true frequencies of nucleotides and deletions according to the haplotype reconstructions and read pileups, respectively. Shown, for each dataset, is the 95% quantile (bar), maximum value (upper error bar), and 75% quanltile (lower error bar) of the errors. We calculated error by summing the absolute value of the difference between the estimated and true frequencies. Each dataset is labeled as animal/week-diversity and the datasets are arranged from least (left) to most (right) diverse. For example, 156-11/0.3 represents the dataset of animal 156 at week 11 which had a diversity level of 0.3%

**Fig. 5**
Paired position errors. For each dataset, we collected all pairs of variable positions that were simultaneously covered by at least 1000 reads. Shown are 95% quantile (bar), maximum value (upper error bar), and 75% quanltile (lower error bar) of the errors for each dataset. See Figure 4 and the text for further details

**Fig. 6**
Single position errors for the dataset of animal 198 at week 6. Shown are the errors at each variable position. See Figure 4 for details on how errors were calculated. The QuRe and QuRe-base panels correspond to reconstruction with and without RegressHaplo error correction, respectively. The error shown in Figure 4 under 198/6-0.6 is the average of the position errors shown in this figure

See this image and copyright information in PMC

Cited by

Reconstruction of Microbial Haplotypes by Integration of Statistical and Physical Linkage in Scaffolding.
Cao C, He J, Mak L, Perera D, Kwok D, Wang J, Li M, Mourier T, Gavriliuc S, Greenberg M, Morrissy AS, Sycuro LK, Yang G, Jeffares DC, Long Q. Cao C, et al. Mol Biol Evol. 2021 May 19;38(6):2660-2672. doi: 10.1093/molbev/msab037. Mol Biol Evol. 2021. PMID: 33547786 Free PMC article.
Epidemiological data analysis of viral quasispecies in the next-generation sequencing era.
Knyazev S, Hughes L, Skums P, Zelikovsky A. Knyazev S, et al. Brief Bioinform. 2021 Jan 18;22(1):96-108. doi: 10.1093/bib/bbaa101. Brief Bioinform. 2021. PMID: 32568371 Free PMC article. Review.
V-pipe 3.0: a sustainable pipeline for within-sample viral genetic diversity estimation.
Fuhrmann L, Jablonski KP, Topolsky I, Batavia AA, Borgsmüller N, Baykal PI, Carrara M, Chen C, Dondi A, Dragan M, Dreifuss D, John A, Langer B, Okoniewski M, du Plessis L, Schmitt U, Singer F, Stadler T, Beerenwinkel N. Fuhrmann L, et al. Gigascience. 2024 Jan 2;13:giae065. doi: 10.1093/gigascience/giae065. Gigascience. 2024. PMID: 39347649 Free PMC article.
The effects of genetic drift and genomic selection on differentiation and local adaptation of the introduced populations of Aedes albopictus in southern Russia.
Konorov EA, Yurchenko V, Patraman I, Lukashev A, Oyun N. Konorov EA, et al. PeerJ. 2021 Jul 21;9:e11776. doi: 10.7717/peerj.11776. eCollection 2021. PeerJ. 2021. PMID: 34327056 Free PMC article.
An integrated software for virus community sequencing data analysis.
Wang M, Li J, Zhang X, Han Y, Yu D, Zhang D, Yuan Z, Yang Z, Huang J, Zhang X. Wang M, et al. BMC Genomics. 2020 May 15;21(1):363. doi: 10.1186/s12864-020-6744-4. BMC Genomics. 2020. PMID: 32414327 Free PMC article.

See all "Cited by" articles

References

1. Altfeld M., Gale M.J. (2015) Innate immunity against HIV-1 infection. Nat. Immunol., 16, 554–562. - PubMed
1. Astrovskaya I. et al. (2011) Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics, 6, S1. - PMC - PubMed
1. Beerenwinkel N., Zagordi O. (2011) Ultra-deep sequencing for the analysis of viral populations. Curr. Opin. Virol., 1, 413–418. - PubMed
1. Beerenwinkel N. et al. (2012) Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front. Microbiol., 3, 1–16. - PMC - PubMed
1. Bimber B.N. et al. (2009) Ultradeep pyrosequencing detects complex patterns of CD8+ T-lymphocyte escape in simian immunodeficiency virus-infected macaques. J. Virol., 83, 8247–8253. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A penalized regression approach to haplotype reconstruction of viral populations arising in early HIV/SIV infection

Affiliations

A penalized regression approach to haplotype reconstruction of viral populations arising in early HIV/SIV infection

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources