Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data

Niko Beerenwinkel¹, Huldrych F Günthard, Volker Roth, Karin J Metzner

Affiliations

PMID: 22973268
PMCID: PMC3438994
DOI: 10.3389/fmicb.2012.00329

Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data

Niko Beerenwinkel et al. Front Microbiol. 2012.

. 2012 Sep 11:3:329.

doi: 10.3389/fmicb.2012.00329. eCollection 2012.

Authors

Niko Beerenwinkel¹, Huldrych F Günthard, Volker Roth, Karin J Metzner

Affiliation

¹ Department of Biosystems Science and Engineering, ETH Zurich Basel, Switzerland.

PMID: 22973268
PMCID: PMC3438994
DOI: 10.3389/fmicb.2012.00329

Abstract

Many viruses, including the clinically relevant RNA viruses HIV (human immunodeficiency virus) and HCV (hepatitis C virus), exist in large populations and display high genetic heterogeneity within and between infected hosts. Assessing intra-patient viral genetic diversity is essential for understanding the evolutionary dynamics of viruses, for designing effective vaccines, and for the success of antiviral therapy. Next-generation sequencing (NGS) technologies allow the rapid and cost-effective acquisition of thousands to millions of short DNA sequences from a single sample. However, this approach entails several challenges in experimental design and computational data analysis. Here, we review the entire process of inferring viral diversity from sample collection to computing measures of genetic diversity. We discuss sample preparation, including reverse transcription and amplification, and the effect of experimental conditions on diversity estimates due to in vitro base substitutions, insertions, deletions, and recombination. The use of different NGS platforms and their sequencing error profiles are compared in the context of various applications of diversity estimation, ranging from the detection of single nucleotide variants (SNVs) to the reconstruction of whole-genome haplotypes. We describe the statistical and computational challenges arising from these technical artifacts, and we review existing approaches, including available software, for their solution. Finally, we discuss open problems, and highlight successful biomedical applications and potential future clinical use of NGS to estimate viral diversity.

Keywords: bioinformatics; error correction; haplotype inference; next-generation sequencing; quasispecies assembly; statistics; viral diversity; viral quasispecies.

PubMed Disclaimer

Figures

**Figure 1**
**Flow chart of sample processing for next-generation sequencing (NGS) of virus samples**.

**Figure 2**
**Spatial scales of diversity estimation from NGS data.** In this example, it is assumed that the true virus population (top of figure) consists of three haplotypes of relative frequencies 60% (A, blue), 30% (B, orange), and 10% (C, green). Segregating sites are indicated by arrows. Twenty short reads (labeled 1 through 20) are generated by NGS from the virus population subject to sequencing errors (indicated in magenta). Reads are displayed in a MSA and in the color of their corresponding parental haplotype. Diversity estimation can be approached at single sites (SNV detection, solid-line rectangle), in windows of the MSA (local haplotype inference, dashed-line rectangle), or over the entire genomic region (global haplotype reconstruction, dotted-line rectangle).

**Figure 3**
**Local read clustering.** The local window of the MSA displayed in Figure 2 is considered (dashed-line rectangle), with colors defined as in Figure 2. Reads that are more similar to each other than to other reads are grouped together which recovers the three original haplotypes A, B, and C of Figure 2 as indicated by the three different colors. Each cluster center sequence is a predicted haplotype (bold, underlined) and the size of its corresponding cluster is an estimate of the frequency of the haplotype (here, 4/f/9, and 2/9, respectively).

**Figure 4**
**Read graph-based global haplotype reconstruction.** Shown is the read graph for the first 15 reads of the MSA shown in Figure 2. Each read is represented by its index and colored according to its parental haplotype (A, blue, first row; B, orange, second row; and C, green, third row). Reads are connected by a direct edge if they agree on their non-empty overlap. Each path from the begin node to the end node represents a potential global haplotype, but there are more paths in the graph than the original three haplotypes the reads have been derived from.

**Figure 5**
**Probabilistic global haplotype reconstruction using a generative mixture model.** Each of the three haplotypes colored as in Figure 2 (A, blue; B, orange; and C, green) is represented as a chain of probability tables over the four nucleotides, where darker shading of a base indicates higher probability. The probabilities of traversing from the begin node to one of the haplotypes serve as an estimate for the haplotype frequencies. Each read is regarded as an independent observation from this statistical model.

See this image and copyright information in PMC

References

1. Abbate I., Vlassi C., Rozera G., Bruselles A., Bartolini B., Giombini E., Corpolongo A., D'Offizi G., Narciso P., Desideri A., Ippolito G., Capobianchi M. R. (2011). Detection of quasispecies variants predicted to use CXCR4 by ultra-deep pyrosequencing during early HIV infection. AIDS 25, 611–617 10.1097/QAD.0b013e328343489e - DOI - PubMed
1. Alteri C., Santoro M. M., Abbate I., Rozera G., Bruselles A., Bartolini B., Gori C., Forbici F., Orchi N., Tozzi V., Palamara G., Antinori A., Narciso P., Girardi E., Svicher V., Ceccherini-Silberstein F., Capobianchi M. R., Perno C. F. (2011). ‘Sentinel’ mutations in standard population sequencing can predict the presence of HIV-1 reverse transcriptase major mutations detectable only by ultra-deep pyrosequencing. J. Antimicrob. Chemother. 66, 2615–2623 10.1093/jac/dkr354 - DOI - PubMed
1. Althaus C. F., Vongrad V., Niederost B., Joos B., Di Giallonardo F., Rieder P., Pavlovic J., Trkola A., Gunthard H. F., Metzner K. J., Fischer M. (2012). Tailored enrichment strategy detects low abundant small noncoding RNAs in HIV-1 infected cells. Retrovirology 9, 27 10.1186/1742-4690-9-27 - DOI - PMC - PubMed
1. Altmann A., Weber P., Quast C., Rex-Haffner M., Binder E. B., Müller-Myhsok B. (2011). vipR: variant identification in pooled DNA using R. Bioinformatics 27, i77–i84 10.1093/bioinformatics/btr205 - DOI - PMC - PubMed
1. Archer J., Baillie G., Watson S. J., Kellam P., Rambaut A., Robertson D. L. (2012). Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator, I. I. BMC Bioinformatics 13, 47 10.1186/1471-2105-13-47 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data

Affiliation

Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources