. 2014 Jun 15;30(12):i329-37.

doi: 10.1093/bioinformatics/btu295.

Accurate viral population assembly from ultra-deep sequencing data

Serghei Mangul¹, Nicholas C Wu¹, Nicholas Mancuso¹, Alex Zelikovsky¹, Ren Sun¹, Eleazar Eskin²

Affiliations

¹ Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA.
² Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USAComputer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA.

PMID: 24932001
PMCID: PMC4058922
DOI: 10.1093/bioinformatics/btu295

Accurate viral population assembly from ultra-deep sequencing data

Serghei Mangul et al. Bioinformatics. 2014.

. 2014 Jun 15;30(12):i329-37.

doi: 10.1093/bioinformatics/btu295.

Authors

Serghei Mangul¹, Nicholas C Wu¹, Nicholas Mancuso¹, Alex Zelikovsky¹, Ren Sun¹, Eleazar Eskin²

Affiliations

¹ Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA.
² Computer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USAComputer Science Department, Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA 90095, USA, Department of Computer Science, Georgia State University, Atlanta, GA, 30303 and Department of Human Genetics, University of California, Los Angeles, CA 90095, USA.

PMID: 24932001
PMCID: PMC4058922
DOI: 10.1093/bioinformatics/btu295

Abstract

Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors.

Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation-maximization algorithm to estimate abundances of the assembled viral variants in the population. RESULTS on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads.

Availability: Our tool VGA is freely available at http://genetics.cs.ucla.edu/vga/

PubMed Disclaimer

Figures

**Fig. 1.**
Overview of high-fidelity sequencing protocol. (A) DNA material from a viral population is cleaved into sequence fragments using any suitable restriction enzyme. (B) Individual barcode sequences are attached to the fragments. Each tagged fragment is amplified by the polymerase chain reaction (PCR). (C) Amplified fragments are then sequenced. (D) Reads are grouped according to the fragment of origin based on their individual barcode sequence. An error-correction protocol is applied for every read group, correcting the sequencing errors inside the group and producing corrected consensus reads. (E) Error-corrected reads are mapped to the population consensus. (F) SNVs are detected and assembled into individual viral genomes. The ordinary protocol lacks steps (B) and (D)

**Fig. 2.**
Overview of VGA. (A) The algorithm takes as input paired-end reads that have been mapped to the population consensus. (B) The first step in the assembly is to determine pairs of conflicting reads that share different SNVs in the overlapping region. Pairs of conflicting reads are connected in the ‘conflict graph’. Each read has a node in the graph, and an edge is placed between each pair of conflicting reads. (C) The graph is colored into a minimal set of colors to distinguish between genome variants in the population. Colors of the graph correspond to independent sets of non-conflicting reads that are assembled into genome variants. In this example, the conflict graph can be minimally colored with four colors (red, green, violet and turquoise), each representing individual viral genomes. (D) Reads of the same color are then assembled into individual viral genomes. Only fully covered viral genomes are reported. (E) Reads are assigned to assembled viral genomes. Read may be shared across two or more viral genomes. VGA infers relative abundances of viral genomes using the expectation–maximization algorithm. (F) Long conserved regions are detected and phased based on expression profiles. In this example red and green viral genome share a long conserved region (colored in black). There is no direct evidence how the viral sub-genomes across the conserved region should be connected. In this example four possible phasing are valid. VGA use the expression information of every sub-genome to resolve ambiguous phasing

**Fig. 3.**
Genomic architecture of 44 real HCV viral genomes from 1739-bp-long fragment of E1E2 region. Length of longest common region shared between any two viral genomes is represented by color

**Fig. 4.**
Accuracy of population size prediction. Up to 200 viral genomes were generated from the Gag/Pol 3.4 kb HIV region. The population diversity is 5–10%. Viral genome abundances follow power-law and uniform distributions. Consensus error-corrected 1002 bp paired-end reads were simulated from HIV population

**Fig. 5.**
Assembly accuracy estimation. Up to 200 viral genomes were generated from the Gag/Pol 3.4 kb HIV region. The population diversity is 3–20%. Viral genome abundances follow power-law and uniform distributions. Consensus error-corrected 2100 bp paired-end reads were simulated from HIV population

**Fig. 6.**
Assembly accuracy estimation. Up to 200 viral genomes were generated from the Gag/Pol 3.4 kb HIV region. The population diversity is 3–20%. Viral genome abundances follow power-law and uniform distributions. Consensus error-corrected 2 × 100 bp paired-end reads were simulated from HIV population

**Fig. 7.**
Assembly accuracy estimation. Consensus error-corrected paired-end reads of various lengths were simulated from a mixture of 10 real viral clones from 1.3-kb-long HIV-1 region. Assembly accuracy as measured by PPV and sensitivity. Results are for 50 000 reads, no improvement was observed when increasing the number of reads

See this image and copyright information in PMC

References

1. Angly FE, et al. Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res. 2012;40:e94–e94. - PMC - PubMed
1. Armin, T, Beerenwinkel N. 2013 http://www.bsse.ethz.ch/cbg/software/InDelFixer.
1. Astrovskaya I. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics. 2011;12(Suppl. 6):S1. - PMC - PubMed
1. Bansal V, Bafna V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008;24:i153–i159. - PubMed
1. Duitama J, et al. Fosmid-based whole genome haplotyping of a hapmap trio child: evaluation of single individual haplotyping techniques. Nucleic Acids Res. 2012;40:2041–2053. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accurate viral population assembly from ultra-deep sequencing data

Affiliations

Accurate viral population assembly from ultra-deep sequencing data

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources