Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 May;27(5):835-848.
doi: 10.1101/gr.215038.116. Epub 2017 Apr 10.

De novo assembly of viral quasispecies using overlap graphs

Affiliations

De novo assembly of viral quasispecies using overlap graphs

Jasmijn A Baaijens et al. Genome Res. 2017 May.

Abstract

A viral quasispecies, the ensemble of viral strains populating an infected person, can be highly diverse. For optimal assessment of virulence, pathogenesis, and therapy selection, determining the haplotypes of the individual strains can play a key role. As many viruses are subject to high mutation and recombination rates, high-quality reference genomes are often not available at the time of a new disease outbreak. We present SAVAGE, a computational tool for reconstructing individual haplotypes of intra-host virus strains without the need for a high-quality reference genome. SAVAGE makes use of either FM-index-based data structures or ad hoc consensus reference sequence for constructing overlap graphs from patient sample data. In this overlap graph, nodes represent reads and/or contigs, while edges reflect that two reads/contigs, based on sound statistical considerations, represent identical haplotypic sequence. Following an iterative scheme, a new overlap assembly algorithm that is based on the enumeration of statistically well-calibrated groups of reads/contigs then efficiently reconstructs the individual haplotypes from this overlap graph. In benchmark experiments on simulated and on real deep-coverage data, SAVAGE drastically outperforms generic de novo assemblers as well as the only specialized de novo viral quasispecies assembler available so far. When run on ad hoc consensus reference sequence, SAVAGE performs very favorably in comparison with state-of-the-art reference genome-guided tools. We also apply SAVAGE on two deep-coverage samples of patients infected by the Zika and the hepatitis C virus, respectively, which sheds light on the genetic structures of the respective viral quasispecies.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
An overview of the workflow and algorithms of SAVAGE. (A) The three stages of SAVAGE. Each assembles sequences into longer sequences. For clarity, we assign different names to the sequences output by each stage: contigs, maximally extended contigs, and master contigs, respectively. (B) Principle of overlap graph construction and distinction among the reads between errors and shared mutations. (C) Each stage has two steps: first, the overlap graph construction; second, assembly. This panel summarizes the differences in each step between the three stages. During overlap graph-based assembly, steps 4 to 6 are repeated iteratively until there are no edges left in the overlap graph.
Figure 2.
Figure 2.
Target genome fraction recovered per strain for all 20,000× benchmarks, stratified by strain frequency.
Figure 3.
Figure 3.
Performance of SAVAGE-de-novo and SAVAGE-b-ref, depending on pairwise distance and mixture ratio. (A) Target genome fraction recovered (%) considering all maximally extended contigs ≥500 bp. (B) Overall mismatch rate (%) considering all maximally extended contigs ≥500 bp. (C) Relative error of estimated frequency for the minor strain (%). Frequency estimates were computed using Kallisto, and only assemblies containing exactly two maximally extended contigs longer than 4000 bp were evaluated.
Figure 4.
Figure 4.
Edge criteria. For an overlap to become an edge in the overlap graph, it must satisfy three criteria. First, the overlap length l must be at least the minimal overlap length L. Second, the overlap quality score QS(R1, R2) must be at least the minimal score δ. For overlaps involving paired-end reads, we require both l1L and l2L, and, analogously, QS(R1a,R2a)δ and QS(R1b,R2b)δ. Finally, we only accept overlaps where the sequence orientations of a paired-end read agree: either both sequences in forward orientation, or both sequences in reverse orientation.
Figure 5.
Figure 5.
Algorithmic details. (A) Read orientations: Given an edge uv with orientations (−,+). Then, if u is labeled +, the induced label for v is −, while if u is labeled −, the induced label for v is +. This procedure leads to a vertex labeling in O(V) time. (B) Transitive edges: An edge uw is called nontransitive, shown in black, if there is no vertex v such that there are edges uv, vw. It is called single transitive, shown in green, if for all vertices v such that there are edges uv, vw, one of the edges is nontransitive. It is called double transitive, shown in red, if there is a vertex v with edges uv, vw which are both transitive. (C) Read clustering by cliques (top) or by pairs (bottom). (D) Error correction: When a consensus sequence is constructed from a cluster of reads, the extremities are removed.

Similar articles

Cited by

References

    1. Altschul S, Gish W, Miller W, Myers E, Lipman D. 1990. Basic local alignment search tool. J Mol Biol 215: 403–410. - PubMed
    1. Astrovskaya I, Tork B, Mangul S, Westbrooks K, Mandoiu I, Balfe P, Zelikovsky A. 2011. Inferring viral quasispecies from 454 pyrosequencing reads. BMC Bioinformatics 12: S1. - PMC - PubMed
    1. Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov A, Lesin V, Nikolenko S, Pham S, Prijbelski A, et al. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19: 455–477. - PMC - PubMed
    1. Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA, Chapuis G, Chikhi R, et al. 2013. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2: 10. - PMC - PubMed
    1. Bray NL, Pimentel H, Melsted P, Pachter L. 2016. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34: 525–527. - PubMed

Publication types

LinkOut - more resources