Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2005 Jun 29:6:165.
doi: 10.1186/1471-2105-6-165.

Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach

Affiliations
Comparative Study

Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach

Jun Lu et al. BMC Bioinformatics. .

Abstract

Background: In testing for differential gene expression involving multiple serial analysis of gene expression (SAGE) libraries, it is critical to account for both between and within library variation. Several methods have been proposed, including the t test, tw test, and an overdispersed logistic regression approach. The merits of these tests, however, have not been fully evaluated. Questions still remain on whether further improvements can be made.

Results: In this article, we introduce an overdispersed log-linear model approach to analyzing SAGE; we evaluate and compare its performance with three other tests: the two-sample t test, tw test and another based on overdispersed logistic linear regression. Analysis of simulated and real datasets show that both the log-linear and logistic overdispersion methods generally perform better than the t and tw tests; the log-linear method is further found to have better performance than the logistic method, showing equal or higher statistical power over a range of parameter values and with different data distributions.

Conclusion: Overdispersed log-linear models provide an attractive and reliable framework for analyzing SAGE experiments involving multiple libraries. For convenience, the implementation of this method is available through a user-friendly web-interface available at http://www.cbcb.duke.edu/sage.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparisons based on simulated data from the beta-binomial distribution. This figure shows the receiver operating characteristic curves (ROC) of the four tests applied to datasets generated from the beta-binomial distribution with various magnitudes of overdispersion (φ) (shown on the top of each graph). For a specific φ, 10,000 observations (tags) are simulated; 5,000 are generated under the assumption that pA = pB and the remaining from pB = 2 pA, where pA and pB are the mean proportions of the two groups and pA = 0.0002 (i.e. 10 out of 50,000). For figures generated under other conditions, see Additional file 1.
Figure 2
Figure 2
Comparisons based on simulated data from the negative binomial distribution. The ROC curves of the four tests are based on datasets generated from the negative binomial distribution with various magnitudes of overdispersion (φ). The data are simulated by the same strategy as used in Figure 1, except that pB = 4pA. Note that the overdispersion parameter here is not directly comparable with that in Figure 1 (the parameter φ for the negative binomial is not directly related to that for the beta-binomial). For figures generated under other conditions, see Additional file 2.
Figure 3
Figure 3
Comparing p-values from the logit-t test and those from the log-t test. Of the top 100 tags (ranked according to p-values) identified by the logit-t test and by the log-t test, 82 are common to both leaving 18 tags from each test that are not within the top 100 identified by the other. The p-values from both tests for these 36 remaining tags are plotted here. The circles represent the 18 in the top 100 by the logit-t test and the triangles those from the log-t test. While all the tags identified by the logit-t test also have reasonably low p-values according to the log-t test, the tags identified by the log-t test show a much wider range of p-values according to the logit-t test.
Figure 4
Figure 4
Plot of standardized residuals against estimated proportions. Standardized Pearson's residuals (y-axis) plotted vs. the proportion estimates (x-axis) for the two groups. The standardized Pearson's residuals are asymptotically distributed as a standard normal. The model fits of two tags (among the list of genes in Table 5) are shown here; the left is from the fit using the overdispersed logistic model and the right from the overdispersed log-linear model. A lower variance of residuals in the group (normal) with lower mean proportion is an indication of poor model fit.
Figure 5
Figure 5
The distribution of overdispersion estimates (formula image). The estimates are from the overdispersed log-linear model fit to the pancreas data. Tags with the overdispersion estimate 0 are not shown in the figure.

Similar articles

Cited by

References

    1. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression.[comment] Science. 1995;270:484–487. - PubMed
    1. Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW. Gene expression profiles in normal and cancer cells. Science. 1997;276:1268–1272. doi: 10.1126/science.276.5316.1268. - DOI - PubMed
    1. Riggins GJ, Strausberg RL. Genome and genetic resources from the Cancer Genome Anatomy Project. Human Molecular Genetics. 2001;10:663–667. doi: 10.1093/hmg/10.7.663. - DOI - PubMed
    1. Porter D, Lahti-Domenici J, Keshaviah A, Bae YK, Argani P, Marks J, Richardson A, Cooper A, Strausberg R, Riggins GJ, Schnitt S, Gabrielson E, Gelman R, Polyak K. Molecular markers in ductal carcinoma in situ of the breast. Molecular Cancer Research: MCR. 2003;1:362–375. - PubMed
    1. Audic S, Claverie JM. The significance of digital gene expression profiles. Genome Research. 1997;7:986–995. - PubMed

Publication types

LinkOut - more resources