Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Sep 15;26(18):i420-5.
doi: 10.1093/bioinformatics/btq365.

Characteristics of 454 pyrosequencing data--enabling realistic simulation with flowsim

Affiliations

Characteristics of 454 pyrosequencing data--enabling realistic simulation with flowsim

Susanne Balzer et al. Bioinformatics. .

Erratum in

  • Bioinformatics. 2011 Aug 1;27(15):2171

Abstract

Motivation: The commercial launch of 454 pyrosequencing in 2005 was a milestone in genome sequencing in terms of performance and cost. Throughout the three available releases, average read lengths have increased to approximately 500 base pairs and are thus approaching read lengths obtained from traditional Sanger sequencing. Study design of sequencing projects would benefit from being able to simulate experiments.

Results: We explore 454 raw data to investigate its characteristics and derive empirical distributions for the flow values generated by pyrosequencing. Based on our findings, we implement Flowsim, a simulator that generates realistic pyrosequencing data files of arbitrary size from a given set of input DNA sequences. We finally use our simulator to examine the impact of sequence lengths on the results of concrete whole-genome assemblies, and we suggest its use in planning of sequencing projects, benchmarking of assembly methods and other fields.

Availability: Flowsim is freely available under the General Public License from http://blog.malde.org/index.php/flowsim/.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(a) A 454 flowgram: cyclic flowing during one read. The light signal strengths (flow values) are directly translated into homopolymer runs. (b) Absolute frequencies of flow values (E.coli). Left: original data, no quality-trimming; right: quality-trimmed. The trimming algorithm enhances the separation of the homopolymer length distributions and levels out discrepancies between the nucleotides such that the curves for the four nucleotides are nearly identical.
Fig. 2.
Fig. 2.
(a) Absolute frequencies of flow values by flow cycle. A total of 200 flow cycles of a Titanium run correspond to 200×4 = 800 flows. The first two flow cycles contain the TCAG tag and are omitted here. Towards the end of a run, flow values tend to lie further away from their ideal values (integers), but are obviously less in number because many values from later flow cycles have been trimmed away. (b) Standard deviation of flow values (difference in relation to their closest integer), by flow cycle. Standard deviation increases almost linearly. Only flow values <5.5 were included.
Fig. 3.
Fig. 3.
Empirical distributions (smoothed average of E.coli and D.labrax) on logarithmic scale. In gray: fitted (log-) normal distributions.
Fig. 4.
Fig. 4.
De novo and reference-based N50 for E.coli. Both real and simulated 454 data were assembled using Newbler v2.3.

References

    1. Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
    1. Blattner FR, et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1462. - PubMed
    1. Brockman W, et al. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 2008;18:763–770. - PMC - PubMed
    1. Engle ML, Burks C. GenFrag 2.1: new features for more robust fragment assembly benchmarks. Comput. Appl. Biosci. 1994;10:567–568. - PubMed
    1. Gomez-Alvarez V, et al. Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009;3:1314–1317. - PubMed

Publication types