Comment

. 2006;7(3):401.

doi: 10.1186/gb-2006-7-3-401. Epub 2006 Mar 22.

A reanalysis of a published Affymetrix GeneChip control dataset

Alan R Dabney, John D Storey

PMID: 16563185
PMCID: PMC1557755
DOI: 10.1186/gb-2006-7-3-401

Comment

A reanalysis of a published Affymetrix GeneChip control dataset

Alan R Dabney et al. Genome Biol. 2006.

. 2006;7(3):401.

doi: 10.1186/gb-2006-7-3-401. Epub 2006 Mar 22.

Authors

Alan R Dabney, John D Storey

PMID: 16563185
PMCID: PMC1557755
DOI: 10.1186/gb-2006-7-3-401

Abstract

A response to Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset by SE Choe, M Boutros, AM Michelson, GM Church and MS Halfon. Genome Biology 2005, 6:R16.

PubMed Disclaimer

Figures

**Figure 1**
QQ plots of null p-values corresponding to null genes. A plot of the observed versus expected quantiles of the null genes' p-values are shown for each of the 10 best datasets. The observed trends indicate that the null genes' p-values trend are substantially smaller than they should be.

**Figure 2**
Histograms of null p-values from simulation representing the experimental design of Choe *et al*. [1]. The null p-values generated from the simulation as described in the text are shown. The dashed line represents the expected height of the bars assuming the null p-values are uniformly distributed. The null p-values are not uniformly distributed when only technical replicates are used.

**Figure 3**
Histograms of null p-values from simulation based on independent samples. The null p-values using three independently sampled individuals as described in the text are shown. The dashed line represents the expected height of the bars assuming the null p-values are uniformly distributed. The null p-values are uniformly distributed when biological replicates are used.

**Figure 4**
A plot of the true versus estimated q-values from simulations described in the text. The solid gray line shows the results averaged over 30 simulations when using a design similar to that of Choe *et al*. [1]. The solid black line is the analogous comparison when using three independent individuals. The dashed line represents equality; conservatively estimated q values should fall beneath this line. The Choe *et al*. [1] design produces anti-conservative q-values estimates due to the incorrect underlying p-values, while the more statistically sound design produces conservative q-value estimates. The Monte Carlo variation of the q-value estimates is small enough that these conclusions are not affected.

**Figure 5**
A detailed description of the Choe *et al*. [1] experiment. Individual PCR products **(a)** were pooled together **(b)** and converted to labeled cRNA **(c)**. Note that all mixing and labeling within each pool was performed at this stage, before splitting the pools into C and S samples. Therefore, relative concentrations of individual cRNA species are identical for all cRNAs in a given pool. **(d)** The labeled pools were then divided into the C and S samples. Poly(C) RNA (20 μg) was added to the C sample at this step to equalize the amount of nucleic acid present in each hybridization. **(e)** Each sample contained enough labeled cRNA for three hybridizations. Relative concentrations for each pool are shown in **(f)**. Note that the 1× ("null gene") pools 13-19 were combined together at step (b), before labeling at step (c), creating a single 1× pool before labeling and splitting. The '1×' concentration of RNA used for this pool was approximately 6× greater than the 1× concentrations of the other pools to reflect the greater number of individual RNAs (that is, so that the 1× concentrations of all RNAs were approximately equal).

**Figure 6**
Sample quantile plots for the p-values of the observed test statistics for the "null genes". The x-axes correspond to the expected quantiles for a uniform distribution and the y-axes correspond to the observed (sample) quantiles. **(a)** Sample quantile plots for the t-test p-values associated with the 150 preprocessing combinations described by Choe *et al*. [1]. Black lines correspond to the 10 best datasets and are consistent with the curves presented in Figure 1 of this correspondence. The red lines correspond to re-loessed datasets that were obtained using the same combinations of preprocessing steps as the original 10 sets with the exception that the invariant subsets consisted only of the 'present null' (present with fold change = 1) probe sets (versus both the 'present null' and 'empty null' probe sets used in [1]). The distribution of the p-values thus depends upon the choice of the invariant subset. **(b)** Sample quantile curves for dataset 10a. Solid lines correspond to the two-sided p-values and the dashed and dotted lines correspond to the p-values associated with the one-sided tests. Dabney and Storey's model does not account for the discrepancy in the one-sided p-values observed for this dataset, which is not manifest in the re-loessed data (red lines). Similar results are seen with datasets 10b, c, d and 9a, b, c, d. **(c)** As in (b) but showing sample quantile curves for dataset 10e; dataset 9e is similar. The p-value discrepancies are much less pronounced for these two datasets. Used with permission from [15].

**Figure 7**
Smoothed estimates of the quartiles as a function of signal intensity for the p-values and observed t-test statistics for **(a, c)** dataset 10a and **(b, d)** dataset 10e. Distributions of the null p-values and test statistics vary with intensity, although less so for the re-loessed datasets (red lines). Even though the medians for the t-statistics seem to be properly centered after re-loessing the data, the quartiles (and hence the variation) are greatly inflated and appear to be intensity dependent. The x-axes show the rankit of the log of the product of the expression means. The y-axes show the observed two-sided p-values (a, b) or the observed t-statistics (c,d). Solid and dashed gray lines indicate the theoretical medians and quartiles, respectively. Black curves correspond to the original datasets and red curves correspond to the re-loessed datasets. Used with permission from [15].

**Figure 8**
Smoothed estimates of the average rank of expression values and squared deviations (with respect to the appropriate group mean) of the three control replicates for the **(a, b)** original and **(c, d)** re-loessed datasets. The x-axes correspond to the rankit of the log of the product of the expression means. The y-axes correspond to the observed ranks and were calculated across all six samples. If the control (C) and spike-in (S) expression values are interchangeable, the average rank of the control values should be 3.5. (c) Re-loessing adequately re-centers the control expression values relative to the spike-in expression values. (d) Despite re-loessing, however, the ranks of the squared deviations for the control replicates remain below those of the spiked-in replicates, suggesting that the expression values for the control replicates are less variable than those for the spiked-in replicates. This difference appears to be intensity dependent. Used with permission from [15].

See this image and copyright information in PMC

Comment on

Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset.
Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS. Choe SE, et al. Genome Biol. 2005;6(2):R16. doi: 10.1186/gb-2005-6-2-r16. Epub 2005 Jan 28. Genome Biol. 2005. PMID: 15693945 Free PMC article.

References

1. Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol. 2005;6:R16. doi: 10.1186/gb-2005-6-2-r16. - DOI - PMC - PubMed
1. Soric B. Statistical discoveries and effect-size estimation. J Am Stat Ass. 1989;84:608–610.
1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Stat Soc, Ser B. 1995;57:289–300.
1. Storey JD. A direct approach to false discovery rates. J Roy Stat Soc, Ser B. 2002;64:479–498. doi: 10.1111/1467-9868.00346. - DOI
1. Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01 HG002913/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A reanalysis of a published Affymetrix GeneChip control dataset

A reanalysis of a published Affymetrix GeneChip control dataset

Authors

Abstract

Figures

Comment on

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources