Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2012 Jun 25;13 Suppl 10(Suppl 10):S6.
doi: 10.1186/1471-2105-13-S10-S6.

Efficient error correction for next-generation sequencing of viral amplicons

Affiliations
Comparative Study

Efficient error correction for next-generation sequencing of viral amplicons

Pavel Skums et al. BMC Bioinformatics. .

Abstract

Background: Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.

Results: In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.

Conclusions: Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Frequency distribution of error-region lengths in a sample of amplicon sequences (dataset M1, k = 25).
Figure 2
Figure 2
Frequency of the true haplotype in single-clone samples. Red bars show the percentage of all reads with the true haplotype and green bars show the frequency of the most common false haplotype.
Figure 3
Figure 3
Minimum spanning tree of single-clone sample S6. Each node is a unique haplotype. The diameter of the node is proportional to the square root of its frequency. The true haplotype is shown in red, haplotypes with indel errors only are shown in yellow, haplotypes with nucleotide substitutions only are shown in blue and haplotypes with both types of errors are shown in green.
Figure 4
Figure 4
Error profile of single-clone samples. Three types of errors are shown: nucleotide replacements, non-homopolymer indels and indels in homopolymer.
Figure 5
Figure 5
Homopolymer indels distribution according to size. (for the notation convenience we consider single nucleotides as homopolymers of length 1, so homopolymer indel in the homopolymer of length 1 is the insertion creating a homopolymer of length 2). Average homopolymer statistics over all 14 samples. The blue bars (left y-axis) show the number of homopolymer indels per read. The red line (left y-axis) shows the fraction of expected homopolymers of that size that contain errors. The green line (right y-axis) shows the percentage of homopolymers of that size that can be found in the real sequence.
Figure 6
Figure 6
Algorithm comparison: the number of missing true haplotypes.
Figure 7
Figure 7
Algorithm comparison: the number of false haplotypes.
Figure 8
Figure 8
Algorithm comparison: frequency of true haplotypes.
Figure 9
Figure 9
Algorithm comparison: the average Hamming distance between false haplotypes and their true targets.

Similar articles

Cited by

References

    1. Wang G, Sherrill-Mix S, Chang K, Quince C, Bushman F. Hepatitis C virus transmission bottlenecks analyzed by deep sequencing. J Virol. 2010;84(12):6218–6228. doi: 10.1128/JVI.02271-09. - DOI - PMC - PubMed
    1. Zagordi O, Klein R, Däumer M, Beerenwinkel N. Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Research. 2010;38(21):7400–7409. doi: 10.1093/nar/gkq655. - DOI - PMC - PubMed
    1. Gilles A, Meglecz E, Pech N, Ferreira S, Malausa T, Martin J. Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics. 2011;12(1):245. doi: 10.1186/1471-2164-12-245. - DOI - PMC - PubMed
    1. Quince C, Lanzén A, Curtis T, Davenport R, Hall N, Head I, Read L, Sloan W. Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods. 2009;6(9):639–641. doi: 10.1038/nmeth.1361. - DOI - PubMed
    1. Zagordi O, Geyrhofer L, Roth V, Beerenwinkel N. Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction. Journal of Computational Biology. 2009;17(417-428) - PubMed

Publication types

LinkOut - more resources