Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Apr 15;29(8):1035-43.
doi: 10.1093/bioinformatics/btt087. Epub 2013 Feb 21.

EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments

Affiliations

EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments

Ning Leng et al. Bioinformatics. .

Erratum in

  • Bioinformatics. 2013 Aug 15;29(16):2073

Abstract

Motivation: Messenger RNA expression is important in normal development and differentiation, as well as in manifestation of disease. RNA-seq experiments allow for the identification of differentially expressed (DE) genes and their corresponding isoforms on a genome-wide scale. However, statistical methods are required to ensure that accurate identifications are made. A number of methods exist for identifying DE genes, but far fewer are available for identifying DE isoforms. When isoform DE is of interest, investigators often apply gene-level (count-based) methods directly to estimates of isoform counts. Doing so is not recommended. In short, estimating isoform expression is relatively straightforward for some groups of isoforms, but more challenging for others. This results in estimation uncertainty that varies across isoform groups. Count-based methods were not designed to accommodate this varying uncertainty, and consequently, application of them for isoform inference results in reduced power for some classes of isoforms and increased false discoveries for others.

Results: Taking advantage of the merits of empirical Bayesian methods, we have developed EBSeq for identifying DE isoforms in an RNA-seq experiment comparing two or more biological conditions. Results demonstrate substantially improved power and performance of EBSeq for identifying DE isoforms. EBSeq also proves to be a robust approach for identifying DE genes.

Availability and implementation: An R package containing examples and sample datasets is available at http://www.biostat.wisc.edu/kendzior/EBSEQ/.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Panel (a) shows two hypothetical genes g and formula image. Gene g has one isoform, denoted by formula image; gene formula image has two (formula image). The problem of estimating expression for isoforms of formula image is complicated by the fact that reads mapping to exon 2 must be unambiguously assigned to each isoform. This results in increased uncertainty, on average, in expression estimates for isoforms sharing a parent. Panel (b) shows hypothetical expression of the isoforms from gene formula image in each of two conditions (assuming differences in library size have been accommodated). If one focuses on the longest isoform (isoform 1) and uses all reads mapping to its constituent isoforms to estimate its expression, the isoform is called equivalently expressed, as there are 30 (6 + 22 + 2) reads mapped in condition 1 and 30 (10 + 16 + 4) mapped in condition 2. However, if the expression of other isoforms is considered, it becomes clear that isoform 1 contains almost twice as many reads in condition 2 as in condition 1 (23 versus 13, respectively). Panel (c) demonstrates how estimation uncertainty changes as isoform complexity increases. We quantified isoform complexity here by Ig where the formula image group represents isoforms from genes with k isoforms (here isoforms from genes with more than three isoforms are included in the formula image group; alternative definitions of complexity are discussed in the text). Shown top right are splines fit to the empirical variance as a function of the mean for all isoforms as well as isoforms within groups defined by Ig for the two-group human embryonic stem cell RNA-seq experiment described in Section 2; bottom right considers isoforms with average expression (expected count) in [100, 500]. The range was chosen as it approximates the 50th and 80th percentiles of expression across all isoforms. Shown are box plots of the variances of these isoforms collectively, and within Ig group. Median variance within each group is shown right
Fig. 2.
Fig. 2.
Panel (a) shows ROC curves (TPR versus FPR). The curves are obtained from averaging over 100 Sim I simulations. Cuffdiff2 deems some isoforms unacceptable prior to analysis; isoforms deemed acceptable by Cuffdiff2 are denoted ‘OK’; and results are reported here for both (see Section 2 for more details). Panel (b) shows the operating characteristics of EBSeq and Cuffdiff2 as a function of ϕ, described in Sim II. The solid and dashed lines indicate power (TPR) and FDR, respectively, at 5% target FDR. Note that BitSeq provides a PPLR for rank ordering isoforms, but does not detail how to use the PPLR scores to control FDR for two-sided test. Consequently, BitSeq is evaluated when ranking isoforms in Panel (a), but not when FDR-controlled lists are considered in Panel (b). Panel (c) shows the CDF of ϕ in four empirical datasets, detailed in Section 2 and the Supplementary Material, as well as the CDF averaged across 100 Sim I simulations
Fig. 3.
Fig. 3.
Panel (a) shows the fold changes (log2 scale) from RNA-seq versus PCR for the 57 genes identified by EBSeq but not Cuffdiff2 and the 17 genes identified by Cuffdiff2 but not EBSeq out of the 716 gold-standard genes from the MAQC dataset. Panel (b) shows box plots of the absolute value of RNA-seq fold changes (log2 scale) for the same 57 and 17 genes, as well as the 2131 and 457 genes identified exclusively by EBSeq and Cuffdiff2, respectively, in the full set of genes. Panel (c) shows ROC curves for baySeq, DESeq, edgeR, Cuffdiff2 and EBSeq. Cuffdiff2 and EBSeq are applied to gene expression estimated via Cufflinks. Results from EBSeq applied to gene expression counts derived from HTSeq are similar (data not shown). The ROC curves based on another threshold in Bullard et al. are shown in panel (d)
Fig. 4.
Fig. 4.
Results are shown for the human embryonic stem cell case study, which compares ESCs with iPSCs. Panel (a) shows a Venn diagram of the genes identified as DE by DESeq, edgeR or EBSeq using HTSeq for quantification. Panel (b) shows a Venn diagram of the genes identified by Cuffdiff2 and EBSeq using Cufflinks-processed data. Panel (c) shows box plots of Dixon’s Q-statistics in three groups of genes—the 114 identified by DESeq, edgeR and EBSeq; the 161 identified by EBSeq but not DESeq or edgeR; and the 196 identified by edgeR but not DESeq or EBSeq. Panel (d) shows the FCRatios and Dixon’s Q-statistics of the genes identified exclusively by each method, but not the other four methods (in this panel, five methods are compared). Note that baySeq, Cuffdiff2, DESeq, edgeR and EBseq (via HTSeq) identify 34, 54, 127, 377 and 334 DE genes, respectively, at 5% FDR. Panel (e) shows box plots of each gene’s 75th percentile of expression for the three groups of genes defined in panel (c). Panel (f) shows the CDF of the 75th percentile of expression among the 34, 54, 127, 377 and 334 DE genes identified by each method

References

    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. - PMC - PubMed
    1. Anders S, et al. Detecting differential usage of exons from RNA-seq data. Genome Res. 2012;22:2008–2017. - PMC - PubMed
    1. Bock C, et al. Reference maps of human ES and IPS cell variation enable high-throughput characterization of pluripotent cell lines. Cell. 2011;144:439–452. - PMC - PubMed
    1. Bullard JH, et al. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics. 2010;11:94. - PMC - PubMed
    1. Consortium M. The microarray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 2006;24:1151–1161. - PMC - PubMed

Publication types