Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug 15;33(16):2455-2463.
doi: 10.1093/bioinformatics/btx187.

A penalized regression approach to haplotype reconstruction of viral populations arising in early HIV/SIV infection

Affiliations

A penalized regression approach to haplotype reconstruction of viral populations arising in early HIV/SIV infection

Sivan Leviyang et al. Bioinformatics. .

Abstract

Motivation: Next generation sequencing (NGS) has been increasingly applied to characterize viral evolution during HIV and SIV infections. In particular, NGS datasets sampled during the initial months of infection are characterized by relatively low levels of diversity as well as convergent evolution at multiple loci dispersed across the viral genome. Consequently, fully characterizing viral evolution from NGS datasets requires haplotype reconstruction across large regions of the viral genome. Existing haplotype reconstruction algorithms have not been developed with the particular characteristics of early HIV/SIV infection in mind, raising the possibility that better performance could be achieved through a specifically designed algorithm.

Results: Here, we introduce a haplotype reconstruction algorithm, RegressHaplo, specifically designed for low diversity and convergent evolution regimes. The algorithm uses a penalized regression that balances a data fitting term with a penalty term that encourages solutions with few haplotypes. The regression covariates are a large set of potential haplotypes and fitting the regression is made computationally feasible by the low diversity setting. Using simulated and in vivo datasets, we compare RegressHaplo to PredictHaplo and QuRe, two existing haplotype reconstruction algorithms. RegressHaplo performs better than these algorithms on simulated datasets with relatively low diversity levels. We suggest RegressHaplo as a novel tool for the investigation of early infection HIV/SIV datasets and, more generally, low diversity viral NGS datasets.

Contact: sr286@georgetown.edu.

Availability and implementation: https://github.com/SLeviyang/RegressHaplo.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Precision versus recall of haplotype reconstruction in a low diversity setting. Datasets D1–D4 had identical levels of diversity, 0.6%, and were identical in all other parameters except for error rates of 1.5%, 1.0%, 0.5%, and 0%, respectively. Each datapoint and number above it represent the recall/precision and number of haplotypes (rounded), respectively, averaged over the 10 simulations in the dataset. Cross-bars on each datapoint give the precision and recall SEs. For D1–D3, results annotated with an F are PredictHaplo and QuRe reconstructions using RegressHaplo’s error correction step, see text for details. Cross-bars for these results have been suppressed for readability but were similar to results without RegressHaplo error correction. A haplotype was counted as recovering a simulated haplotype if the Hamming distance between the two was 2 or less. PH, PredictHaplo; QR, QuRe; RH, RegressHaplo
Fig. 2
Fig. 2
Precision versus recall of haplotype reconstruction in a high diversity setting. Datasets D5 and D6 had diversity levels of 1.6% and read error rates of 1.5% and 0%, respectively. Dataset D7 was identical to D6, except that a long conserved region was introduced, see text for details
Fig. 3
Fig. 3
Precision versus recall of haplotype reconstruction for paired-end datasets. Datasets D8 and D9 had low diversity levels, identical to dataset D2, except that D8 and D9 were constructed with paired-end reads and D9 had a long conserved region inserted. The pair-end reads collectively covered 450 nucleotides versus 250 nucleotides covered by the single-end reads in D2. The panel shown for D2 is identical to the D2 panel in Figure 1
Fig. 4
Fig. 4
Single position errors. For each variable position on the reference, we calculated the estimated and true frequencies of nucleotides and deletions according to the haplotype reconstructions and read pileups, respectively. Shown, for each dataset, is the 95% quantile (bar), maximum value (upper error bar), and 75% quanltile (lower error bar) of the errors. We calculated error by summing the absolute value of the difference between the estimated and true frequencies. Each dataset is labeled as animal/week-diversity and the datasets are arranged from least (left) to most (right) diverse. For example, 156-11/0.3 represents the dataset of animal 156 at week 11 which had a diversity level of 0.3%
Fig. 5
Fig. 5
Paired position errors. For each dataset, we collected all pairs of variable positions that were simultaneously covered by at least 1000 reads. Shown are 95% quantile (bar), maximum value (upper error bar), and 75% quanltile (lower error bar) of the errors for each dataset. See Figure 4 and the text for further details
Fig. 6
Fig. 6
Single position errors for the dataset of animal 198 at week 6. Shown are the errors at each variable position. See Figure 4 for details on how errors were calculated. The QuRe and QuRe-base panels correspond to reconstruction with and without RegressHaplo error correction, respectively. The error shown in Figure 4 under 198/6-0.6 is the average of the position errors shown in this figure

Similar articles

Cited by

References

    1. Altfeld M., Gale M.J. (2015) Innate immunity against HIV-1 infection. Nat. Immunol., 16, 554–562. - PubMed
    1. Astrovskaya I. et al. (2011) Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics, 6, S1. - PMC - PubMed
    1. Beerenwinkel N., Zagordi O. (2011) Ultra-deep sequencing for the analysis of viral populations. Curr. Opin. Virol., 1, 413–418. - PubMed
    1. Beerenwinkel N. et al. (2012) Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front. Microbiol., 3, 1–16. - PMC - PubMed
    1. Bimber B.N. et al. (2009) Ultradeep pyrosequencing detects complex patterns of CD8+ T-lymphocyte escape in simian immunodeficiency virus-infected macaques. J. Virol., 83, 8247–8253. - PMC - PubMed