Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009;10(8):R83.
doi: 10.1186/gb-2009-10-8-r83. Epub 2009 Aug 14.

Improved base calling for the Illumina Genome Analyzer using machine learning strategies

Affiliations

Improved base calling for the Illumina Genome Analyzer using machine learning strategies

Martin Kircher et al. Genome Biol. 2009.

Abstract

The Illumina Genome Analyzer generates millions of short sequencing reads. We present Ibis (Improved base identification system), an accurate, fast and easy-to-use base caller that significantly reduces the error rate and increases the output of usable reads. Ibis is faster and more robust with respect to chemistry and technology than other publicly available packages. Ibis is freely available under the GPL from http://bioinf.eva.mpg.de/Ibis/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Intensity values for one tile of a 51-cycle PhiX 174 RF1 run before and after correction by Bustard. On this tile 115,288 clusters were identified by the image analysis software Firecrest. Shown are the 95th percentile for the signal intensities in each channel and cycle. The raw intensities are shown with dashed lines, the intensities after transformation by Bustard are shown with solid lines. Intensities for A, C, and G decline over the run while the intensities for T stay nearly constant. Both effects can be explained by degradation of the fluorophores or non-reversible termination of sequences over the run as well as the accumulation of T fluorophores on the synthesized strand. Intensities for the first cycle are lower than in other cycles due to dimming and bleaching caused by longer handling times before imaging of the first cycle. Corrected intensities for the last and first cycle do not follow the normal trend, since full phasing correction cannot be applied.
Figure 2
Figure 2
Analysis of mismatches. Analysis of mismatches seen for (a) Bustard raw reads and (b) Ibis raw reads of a lane with 11,478,043 PhiX 174 RF1 raw reads sequenced with 51 cycles and mapped to the corresponding reference genome allowing up to 5 mismatches (including N characters). For Bustard 9,110,666 (79.4%) raw reads can be mapped, and for Ibis 9,695,354 (84.5%) raw reads. The sequencing error, measured as the mismatch rate, increases with cycle number. For Bustard, starting around cycle 25, guanine is mistaken as thymine. In later cycles adenosine and cytosine are also mistaken as thymine, due to increasing T accumulation. The error rate of the last base is especially high due to incomplete phasing correction. The patterns of specific base mismatches are not observed when Ibis is used.
Figure 3
Figure 3
Fraction of mapped reads and corresponding number of mismatches for the three tested lanes. (a) The result for one lane of human shot gun sequence analyzed on a 26 cycle Genome Analyzer I run (A1); (b) the PhiX control lane of the very same 26 cycle Genome Analyzer I run (A2); (c) the PhiX control lane of a 51 cycle Genome Analyzer II (B). The raw sequences of all three lanes were mapped to the corresponding reference genome (hg18/NCBI Build 36.1 and PhiX 174 RF1) with up to five mismatches but no gaps using SOAP v1.11. For A1, further analyses were restricted to sequences mapping with at most two mismatches to reduce the number false positive placements expected when mapping short reads to a large genome sequence.
Figure 4
Figure 4
Comparison of quality scores for the 51 cycle PhiX control lane data. Quality scores reported by Bustard, AltaCyclic and Ibis are compared in PHRED scale. For all three base callers, we considered only quality scores reported with 100,000 and more observations. Calculating the deviation from the optimal line, Bustard has a root mean square deviation of 84.9, AltaCyclic of 19.3 and Ibis of 0.9.

References

    1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. doi: 10.1038/nature07517. - DOI - PMC - PubMed
    1. Erlich Y, Mitra PP, delaBastide M, McCombie WR, Hannon GJ. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat Methods. 2008;5:679–682. doi: 10.1038/nmeth.1230. - DOI - PMC - PubMed
    1. Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics. 2008;9:431. doi: 10.1186/1471-2105-9-431. - DOI - PMC - PubMed
    1. R Development Core Team . R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2008.
    1. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature. 2008;452:215–219. doi: 10.1038/nature06745. - DOI - PMC - PubMed

Publication types

LinkOut - more resources