Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jul 1;27(13):i304-9.
doi: 10.1093/bioinformatics/btr251.

Systematic exploration of error sources in pyrosequencing flowgram data

Affiliations

Systematic exploration of error sources in pyrosequencing flowgram data

Susanne Balzer et al. Bioinformatics. .

Abstract

Motivation: 454 pyrosequencing, by Roche Diagnostics, has emerged as an alternative to Sanger sequencing when it comes to read lengths, performance and cost, but shows higher per-base error rates. Although there are several tools available for noise removal, targeting different application fields, data interpretation would benefit from a better understanding of the different error types.

Results: By exploring 454 raw data, we quantify to what extent different factors account for sequencing errors. In addition to the well-known homopolymer length inaccuracies, we have identified errors likely to originate from other stages of the sequencing process. We use our findings to extend the flowsim pipeline with functionalities to simulate these errors, and thus enable a more realistic simulation of 454 pyrosequencing data with flowsim.

Availability: The flowsim pipeline is freely available under the General Public License from http://biohaskell.org/Applications/FlowSim.

Contact: susanne.balzer@imr.no.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Empirical flow values distributions (D.labrax) and derived intervals.
Fig. 2.
Fig. 2.
Bins for homopolymer lengths 0, 1 and 2, based on different flow value interval sizes from Table 1.
Fig. 3.
Fig. 3.
Flow value histograms for G.morhua mate-pair reads (forward matches, N=7 016 764). The y-axis is on a log10 scale. The 15 flow cycles correspond to the 42 positions of the linker sequence. The gray areas contain correct base calls. Subpeaks point toward putative PCR errors.
Fig. 4.
Fig. 4.
Putative PCR and pyrosequencing error rates with respect to flow cycles (for underlying flow value intervals of size 5 and 95%).

References

    1. Altschul S.F., et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
    1. Balzer S., et al. Characteristics of 454 pyrosequencing data–enabling realistic simulation with flowsim. Bioinformatics. 2010;26:i420–i425. - PMC - PubMed
    1. Chou H.H., Holmes M.H. DNA sequence quality trimming and vector removal. Bioinformatics. 2001;17:1093–1104. - PubMed
    1. Gomez-Alvarez V., et al. Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009;3:1314–1317. - PubMed
    1. Harismendy O., et al. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 2009;10:R32. - PMC - PubMed

Publication types