Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Oct;19(10):1884-95.
doi: 10.1101/gr.095299.109. Epub 2009 Aug 6.

BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing

Affiliations

BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing

Wei-Chun Kao et al. Genome Res. 2009 Oct.

Abstract

Extracting sequence information from raw images of fluorescence is the foundation underlying several high-throughput sequencing platforms. Some of the main challenges associated with this technology include reducing the error rate, assigning accurate base-specific quality scores, and reducing the cost of sequencing by increasing the throughput per run. To demonstrate how computational advancement can help to meet these challenges, a novel model-based base-calling algorithm, BayesCall, is introduced for the Illumina sequencing platform. Being founded on the tools of statistical learning, BayesCall is flexible enough to incorporate various features of the sequencing process. In particular, it can easily incorporate time-dependent parameters and model residual effects. This new approach significantly improves the accuracy over Illumina's base-caller Bustard, particularly in the later cycles of a sequencing run. For 76-cycle data on a standard viral sample, phiX174, BayesCall improves Bustard's average per-base error rate by approximately 51%. The probability of observing each base can be readily computed in BayesCall, and this probability can be transformed into a useful base-specific quality score with a high discrimination ability. A detailed study of BayesCall's performance is presented here.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The graphical model corresponding to our base-calling algorithm for cluster k. The observed random variables are the intensities It,k. Base-calling is done by finding the MAP estimates of St,k. In this example, the window size is 3, with l = r = 1. See Methods for a detailed description.
Figure 2.
Figure 2.
Convergence of simulated annealing with 1000, 5000, 10,000, and 20,000 total iterations. The temperature parameter in the ith iteration of simulated annealing is taken as (ni + 1)/n, with n being the total number of iterations. Although using a larger value of n maximizes the likelihood slightly better, the inferred MAP estimate of Sk does not change so much.
Figure 3.
Figure 3.
Estimated cycle-dependent parameters for four different tiles of the 76-cycle phiX174 data set. These plots illustrate that parameters change over time and that the residual effect associated with αt tends grow with time. (A) dt, (B) αt.
Figure 4.
Figure 4.
Observed intensities and the decomposition of μt,k for a particular cluster k in the 76-cycle phiX174 data. (A) Observed intensities It,k. (B) The contribution of Λt,kXtSkQtω to μt,k. (C) The contribution of the residual effect αt (1 − dt) It−1,k.
Figure 5.
Figure 5.
Base-calling results for the particular cluster discussed in Figure 4. Our method BayesCall called all 76 bases correctly for this particular cluster. In contrast, Bustard made 14 base-calling errors, with 13 of them incorrectly called as T. Alta-Cyclic made three errors, with all of them incorrectly called as T. (*) Base-calling errors. In general, both Bustard and Alta-Cyclic tend to suffer more from the “anomalous T” effect than does our method. (See text for details.)
Figure 6.
Figure 6.
Average per-cycle error rates and histograms for the number of errors per read. BayesCall had substantially lower error rates in later cycles compared with Bustard, and the difference tended to increase with cycles. Further, BayesCall produced substantially more perfect reads than did Bustard. Alta-Cyclic was run only on the 76-cycle phiX174 data set. BayesCall had a lower average error rate than that of Alta-Cyclic's for all cycles. Note that although Alta-Cyclic is more accurate than Bustard in later cycles, the opposite is true for earlier cycles. (A) Results for the 36-cycle data set. (B) Results for the 76-cycle data set.
Figure 7.
Figure 7.
Heat plot of joint errors in Bustard and BayesCall for the 76-cycle phiX174 data. This plot depicts the joint error matrix shown in Table 3. The (x,y) entry in the plot corresponds to log2 of the number of reads with x errors in Bustard and y errors in BayesCall. This plot clearly illustrates that BayesCall generally produces sequence reads with substantially fewer errors than that produced by Bustard.
Figure 8.
Figure 8.
Discrimination ability D(ɛ) of quality scores at error tolerance ɛ. We define D(ɛ) as the number of correctly called bases at error tolerance ɛ. BayesCall maintains a high discrimination ability which outperforms both Bustard's and Alta-Cyclic's. (A) Results for the 36-cycle phiX174 data set. (B) Results for the 76-cycle phiX174 data set.

References

    1. Bentley DR. Whole-genome re-sequencing. Curr Opin Genet Dev. 2006;16:545–552. - PubMed
    1. Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB, et al. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 2008;18:763–770. - PMC - PubMed
    1. Erlich Y, Mitra P, Delabastide M, McCombie W, Hannon G. Alta-Cyclic: A self-optimizing base caller for next-generation sequencing. Nat Methods. 2008;5:679–682. - PMC - PubMed
    1. Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998;8:186–194. - PubMed
    1. Li L, Speed T. An estimate of the crosstalk matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis. 1999;20:1433–1442. - PubMed

Publication types

LinkOut - more resources