Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;12 Suppl 6(Suppl 6):S1.
doi: 10.1186/1471-2105-12-S6-S1. Epub 2011 Jul 28.

Inferring viral quasispecies spectra from 454 pyrosequencing reads

Affiliations

Inferring viral quasispecies spectra from 454 pyrosequencing reads

Irina Astrovskaya et al. BMC Bioinformatics. 2011.

Abstract

Background: RNA viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. The genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences.

Results: In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Experimental results show that ViSpA outperforms ShoRAH on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. While ShoRAH has a significant advantage over ViSpA on reads simulated with sequencing errors due to its advanced error correction algorithm, ViSpA is better at assembling the simulated reads after they have been corrected by ShoRAH. ViSpA also outperforms ShoRAH on real 454 reads. Indeed, 7 most frequent sequences reconstructed by ViSpA from a real HCV dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and Sanger sequencing. In contrast, only one of the sequences reconstructed by ShoRAH is viable. On a real HIV dataset, ShoRAH correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas ViSpA correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. ViSpA source code is available at http://alla.cs.gsu.edu/~software/VISPA/vispa.html.

Conclusions: ViSpA enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. We are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations.

PubMed Disclaimer

Figures

Figure 1
Figure 1
ViSpA’s flowchart.
Figure 2
Figure 2
Statistical validation on error-free reads from known HCV quasispecies. Left: PPV and sensitivity as a function of the number of quasispecies in the original population (40K reads with average read length 300). Right: the relative entropy as a function of the average read length (40K reads from 10 quasispecies).
Figure 3
Figure 3
Percentage of candidate sequences which cumulative frequency is 85%, 90%, and 95%. The values on x-axis corresponds to the number of allowed mismatches during read graph construction. n_m means that up to n mismatches are allowed in superreads and up to m mismatches are allowed in edges.
Figure 4
Figure 4
The neighbor-joining phylogenetic tree for 10 most frequent HCV quasispecies variants on a 5,205bp-long fragment obtained by ViSpA and ShoRAH. Sequences are labeled with software name and its rank among 10 most frequent assembled sequences.
Figure 5
Figure 5
Percentage of runs when the i-th most frequent sequence is reproduced among 10 most frequent quasispecies assembled on the 10%-reduced set of reads. The i-th point at x-axis corresponds to the i-th most frequent sequence assembled on the 100% of reads. No data are shown for the sequences that are reproduced less than 5% of runs.

References

    1. Duarte EA, Novella IS, Weaver SC, Domingo E, Wain-Hobson S, Clarke DK, Moya A, Elena SF, de la Torre JC, Holland JJ. RNA virus quasispecies:significance for viral disease and epidemiology. Infectious Agents and Disease. 1994;3:201–214. - PubMed
    1. Drake JW, Holland JJ. Mutation rates among RNA viruses. Proc Natl Acad Sci U S A. 1999;96(24):13910–13913. doi: 10.1073/pnas.96.24.13910. - DOI - PMC - PubMed
    1. Domingo E, Holland JJ. RNA virus mutations and fitness for survival. Annu Rev Microbiol. 1997;51:151–178. doi: 10.1146/annurev.micro.51.1.151. - DOI - PubMed
    1. Martinez-Salas E, Sobrino F, de la Torre JC, Portela A, Ortin J, Lopez-Galindez C, Perez-Brena P, Villanueva N, Najera R, Domingo E. The quasispecies (extremely heterogeneous) nature of viral RNA genome populations: biological relevance - a review. Gene. 1985;40:1–8. doi: 10.1016/0378-1119(85)90017-4. - DOI - PubMed
    1. Eigen M, McCaskill J, Schuster P. The molecular quasi-species. Adv Chem Phys. 1989;75:149–263.

Publication types

LinkOut - more resources