Comparative Study

. 2012 Jun 25;13 Suppl 10(Suppl 10):S6.

doi: 10.1186/1471-2105-13-S10-S6.

Efficient error correction for next-generation sequencing of viral amplicons

Pavel Skums¹, Zoya Dimitrova, David S Campo, Gilberto Vaughan, Livia Rossi, Joseph C Forbi, Jonny Yokosawa, Alex Zelikovsky, Yury Khudyakov

Affiliations

Affiliation

¹ Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, Atlanta, GA 30333, USA. kki8@cdc.gov

PMID: 22759430
PMCID: PMC3382444
DOI: 10.1186/1471-2105-13-S10-S6

Comparative Study

Efficient error correction for next-generation sequencing of viral amplicons

Pavel Skums et al. BMC Bioinformatics. 2012.

. 2012 Jun 25;13 Suppl 10(Suppl 10):S6.

doi: 10.1186/1471-2105-13-S10-S6.

Authors

Pavel Skums¹, Zoya Dimitrova, David S Campo, Gilberto Vaughan, Livia Rossi, Joseph C Forbi, Jonny Yokosawa, Alex Zelikovsky, Yury Khudyakov

Affiliation

¹ Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, Atlanta, GA 30333, USA. kki8@cdc.gov

PMID: 22759430
PMCID: PMC3382444
DOI: 10.1186/1471-2105-13-S10-S6

Abstract

Background: Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.

Results: In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.

Conclusions: Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm.

PubMed Disclaimer

Figures

**Figure 1**
**Frequency distribution of error-region lengths in a sample of amplicon sequences (dataset M1, k = 25)**.

**Figure 2**
**Frequency of the true haplotype in single-clone samples**. Red bars show the percentage of all reads with the true haplotype and green bars show the frequency of the most common false haplotype.

**Figure 3**
**Minimum spanning tree of single-clone sample S6**. Each node is a unique haplotype. The diameter of the node is proportional to the square root of its frequency. The true haplotype is shown in red, haplotypes with indel errors only are shown in yellow, haplotypes with nucleotide substitutions only are shown in blue and haplotypes with both types of errors are shown in green.

**Figure 4**
**Error profile of single-clone samples**. Three types of errors are shown: nucleotide replacements, non-homopolymer indels and indels in homopolymer.

**Figure 5**
**Homopolymer indels distribution according to size**. (for the notation convenience we consider single nucleotides as homopolymers of length 1, so homopolymer indel in the homopolymer of length 1 is the insertion creating a homopolymer of length 2). Average homopolymer statistics over all 14 samples. The blue bars (left y-axis) show the number of homopolymer indels per read. The red line (left y-axis) shows the fraction of expected homopolymers of that size that contain errors. The green line (right y-axis) shows the percentage of homopolymers of that size that can be found in the real sequence.

**Figure 6**
**Algorithm comparison: the number of missing true haplotypes**.

**Figure 7**
**Algorithm comparison: the number of false haplotypes**.

**Figure 8**
**Algorithm comparison: frequency of true haplotypes**.

**Figure 9**
**Algorithm comparison: the average Hamming distance between false haplotypes and their true targets**.

See this image and copyright information in PMC

Cited by

Error correction and statistical analyses for intra-host comparisons of feline immunodeficiency virus diversity from high-throughput sequencing data.
Liu Y, Chiaromonte F, Ross H, Malhotra R, Elleder D, Poss M. Liu Y, et al. BMC Bioinformatics. 2015 Jun 30;16:202. doi: 10.1186/s12859-015-0607-z. BMC Bioinformatics. 2015. PMID: 26123018 Free PMC article.
SeekDeep: single-base resolution de novo clustering for amplicon deep sequencing.
Hathaway NJ, Parobek CM, Juliano JJ, Bailey JA. Hathaway NJ, et al. Nucleic Acids Res. 2018 Feb 28;46(4):e21. doi: 10.1093/nar/gkx1201. Nucleic Acids Res. 2018. PMID: 29202193 Free PMC article.
Deep-sequencing of the peach latent mosaic viroid reveals new aspects of population heterogeneity.
Glouzon JP, Bolduc F, Wang S, Najmanovich RJ, Perreault JP. Glouzon JP, et al. PLoS One. 2014 Jan 30;9(1):e87297. doi: 10.1371/journal.pone.0087297. eCollection 2014. PLoS One. 2014. PMID: 24498066 Free PMC article.
HIV-1 tropism dynamics and phylogenetic analysis from longitudinal ultra-deep sequencing data of CCR5- and CXCR4-using variants.
Sede MM, Moretti FA, Laufer NL, Jones LR, Quarleri JF. Sede MM, et al. PLoS One. 2014 Jul 17;9(7):e102857. doi: 10.1371/journal.pone.0102857. eCollection 2014. PLoS One. 2014. PMID: 25032817 Free PMC article.
Transmissibility of intra-host hepatitis C virus variants.
Campo DS, Zhang J, Ramachandran S, Khudyakov Y. Campo DS, et al. BMC Genomics. 2017 Dec 6;18(Suppl 10):881. doi: 10.1186/s12864-017-4267-4. BMC Genomics. 2017. PMID: 29244001 Free PMC article.

See all "Cited by" articles

References

1. Wang G, Sherrill-Mix S, Chang K, Quince C, Bushman F. Hepatitis C virus transmission bottlenecks analyzed by deep sequencing. J Virol. 2010;84(12):6218–6228. doi: 10.1128/JVI.02271-09. - DOI - PMC - PubMed
1. Zagordi O, Klein R, Däumer M, Beerenwinkel N. Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Research. 2010;38(21):7400–7409. doi: 10.1093/nar/gkq655. - DOI - PMC - PubMed
1. Gilles A, Meglecz E, Pech N, Ferreira S, Malausa T, Martin J. Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics. 2011;12(1):245. doi: 10.1186/1471-2164-12-245. - DOI - PMC - PubMed
1. Quince C, Lanzén A, Curtis T, Davenport R, Hall N, Head I, Read L, Sloan W. Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods. 2009;6(9):639–641. doi: 10.1038/nmeth.1361. - DOI - PubMed
1. Zagordi O, Geyrhofer L, Roth V, Beerenwinkel N. Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction. Journal of Computational Biology. 2009;17(417-428) - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient error correction for next-generation sequencing of viral amplicons

Affiliation

Efficient error correction for next-generation sequencing of viral amplicons

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous