Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 15;30(12):i329-37.
doi: 10.1093/bioinformatics/btu295.

Accurate viral population assembly from ultra-deep sequencing data

Affiliations

Accurate viral population assembly from ultra-deep sequencing data

Serghei Mangul et al. Bioinformatics. .

Abstract

Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors.

Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation-maximization algorithm to estimate abundances of the assembled viral variants in the population. RESULTS on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads.

Availability: Our tool VGA is freely available at http://genetics.cs.ucla.edu/vga/

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of high-fidelity sequencing protocol. (A) DNA material from a viral population is cleaved into sequence fragments using any suitable restriction enzyme. (B) Individual barcode sequences are attached to the fragments. Each tagged fragment is amplified by the polymerase chain reaction (PCR). (C) Amplified fragments are then sequenced. (D) Reads are grouped according to the fragment of origin based on their individual barcode sequence. An error-correction protocol is applied for every read group, correcting the sequencing errors inside the group and producing corrected consensus reads. (E) Error-corrected reads are mapped to the population consensus. (F) SNVs are detected and assembled into individual viral genomes. The ordinary protocol lacks steps (B) and (D)
Fig. 2.
Fig. 2.
Overview of VGA. (A) The algorithm takes as input paired-end reads that have been mapped to the population consensus. (B) The first step in the assembly is to determine pairs of conflicting reads that share different SNVs in the overlapping region. Pairs of conflicting reads are connected in the ‘conflict graph’. Each read has a node in the graph, and an edge is placed between each pair of conflicting reads. (C) The graph is colored into a minimal set of colors to distinguish between genome variants in the population. Colors of the graph correspond to independent sets of non-conflicting reads that are assembled into genome variants. In this example, the conflict graph can be minimally colored with four colors (red, green, violet and turquoise), each representing individual viral genomes. (D) Reads of the same color are then assembled into individual viral genomes. Only fully covered viral genomes are reported. (E) Reads are assigned to assembled viral genomes. Read may be shared across two or more viral genomes. VGA infers relative abundances of viral genomes using the expectation–maximization algorithm. (F) Long conserved regions are detected and phased based on expression profiles. In this example red and green viral genome share a long conserved region (colored in black). There is no direct evidence how the viral sub-genomes across the conserved region should be connected. In this example four possible phasing are valid. VGA use the expression information of every sub-genome to resolve ambiguous phasing
Fig. 3.
Fig. 3.
Genomic architecture of 44 real HCV viral genomes from 1739-bp-long fragment of E1E2 region. Length of longest common region shared between any two viral genomes is represented by color
Fig. 4.
Fig. 4.
Accuracy of population size prediction. Up to 200 viral genomes were generated from the Gag/Pol 3.4 kb HIV region. The population diversity is 5–10%. Viral genome abundances follow power-law and uniform distributions. Consensus error-corrected 1002 bp paired-end reads were simulated from HIV population
Fig. 5.
Fig. 5.
Assembly accuracy estimation. Up to 200 viral genomes were generated from the Gag/Pol 3.4 kb HIV region. The population diversity is 3–20%. Viral genome abundances follow power-law and uniform distributions. Consensus error-corrected 2100 bp paired-end reads were simulated from HIV population
Fig. 6.
Fig. 6.
Assembly accuracy estimation. Up to 200 viral genomes were generated from the Gag/Pol 3.4 kb HIV region. The population diversity is 3–20%. Viral genome abundances follow power-law and uniform distributions. Consensus error-corrected 2 × 100 bp paired-end reads were simulated from HIV population
Fig. 7.
Fig. 7.
Assembly accuracy estimation. Consensus error-corrected paired-end reads of various lengths were simulated from a mixture of 10 real viral clones from 1.3-kb-long HIV-1 region. Assembly accuracy as measured by PPV and sensitivity. Results are for 50 000 reads, no improvement was observed when increasing the number of reads

Similar articles

Cited by

References

    1. Angly FE, et al. Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res. 2012;40:e94–e94. - PMC - PubMed
    1. Armin, T, Beerenwinkel N. 2013 http://www.bsse.ethz.ch/cbg/software/InDelFixer.
    1. Astrovskaya I. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics. 2011;12(Suppl. 6):S1. - PMC - PubMed
    1. Bansal V, Bafna V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics. 2008;24:i153–i159. - PubMed
    1. Duitama J, et al. Fosmid-based whole genome haplotyping of a hapmap trio child: evaluation of single individual haplotyping techniques. Nucleic Acids Res. 2012;40:2041–2053. - PMC - PubMed

Publication types