. 2010 May 25;107(21):9546-51.

doi: 10.1073/pnas.0914005107. Epub 2010 May 11.

Independent filtering increases detection power for high-throughput experiments

Richard Bourgon¹, Robert Gentleman, Wolfgang Huber

Affiliations

PMID: 20460310
PMCID: PMC2906865
DOI: 10.1073/pnas.0914005107

Independent filtering increases detection power for high-throughput experiments

Richard Bourgon et al. Proc Natl Acad Sci U S A. 2010.

. 2010 May 25;107(21):9546-51.

doi: 10.1073/pnas.0914005107. Epub 2010 May 11.

Authors

Richard Bourgon¹, Robert Gentleman, Wolfgang Huber

Affiliation

¹ European Bioinformatics Institute, Cambridge CB10 1SD, UK.

PMID: 20460310
PMCID: PMC2906865
DOI: 10.1073/pnas.0914005107

Abstract

With high-dimensional data, variable-by-variable statistical testing is often used to select variables whose behavior differs across conditions. Such an approach requires adjustment for multiple testing, which can result in low statistical power. A two-stage approach that first filters variables by a criterion independent of the test statistic, and then only tests variables which pass the filter, can provide higher power. We show that use of some filter/test statistics pairs presented in the literature may, however, lead to loss of type I error control. We describe other pairs which avoid this problem. In an application to microarray data, we found that gene-by-gene filtering by overall variance followed by a t-test increased the number of discoveries by 50%. We also show that this particular statistic pair induces a lower bound on fold-change among the set of discoveries. Independent filtering-using filter/test pairs that are independent under the null hypothesis but correlated under the alternative-is a general approach that can substantially increase the efficiency of experiments.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
Power assessment of filtering applied to the ALL data (12,625 genes). R, the number of genes called differentially expressed between the two cytogenetic groups, was computed for different stage-one filters, filtering stringencies, and FDR-adjusted p-value cutoffs. In all cases, a standard t-statistic (T) was used in stage two, and adjustment for multiple testing was by the method of ref. . Similar results were obtained with other adjustment procedures. Filter cutoffs were selected so that a fraction θ of genes were removed. A random filter, which arbitrarily selected and removed one half of the genes, was also considered. (A) Filtering on overall variance (S²). At all FDR cutoffs, increasingly stringent filtering increased total discoveries, even though fewer genes were tested. This effect was not, however, due to the reduction in the number of hypotheses alone: filtering half of the genes at random reduced total discoveries by approximately one half, as expected. (B) Filtering on overall mean (), on the other hand, produced a small increase in rejections at low stringency, but then substantially reduced rejections, and thus power, at higher stringencies. (C) Effect of increasing filtering stringency for fixed adjusted p-value cutoff α = 0.1. At higher stringencies, both filters eventually reduced rejections. For the ALL data, this effect occurred much more quickly for the overall mean filter. With the overall variance filter, the number of rejections increased by up to 50%. (D) Filtering on overall mean (θ = 0.5 is shown) removed many significant |T_i| (e.g., |T_i| > 4), while filtering on overall variance retained them.

formula image — **Fig. 1.**
Power assessment of filtering applied to the ALL data (12,625 genes). R, the number of genes called differentially expressed between the two cytogenetic groups, was computed for different stage-one filters, filtering stringencies, and FDR-adjusted p-value cutoffs. In all cases, a standard t-statistic (T) was used in stage two, and adjustment for multiple testing was by the method of ref. . Similar results were obtained with other adjustment procedures. Filter cutoffs were selected so that a fraction θ of genes were removed. A random filter, which arbitrarily selected and removed one half of the genes, was also considered. (A) Filtering on overall variance (S²). At all FDR cutoffs, increasingly stringent filtering increased total discoveries, even though fewer genes were tested. This effect was not, however, due to the reduction in the number of hypotheses alone: filtering half of the genes at random reduced total discoveries by approximately one half, as expected. (B) Filtering on overall mean (), on the other hand, produced a small increase in rejections at low stringency, but then substantially reduced rejections, and thus power, at higher stringencies. (C) Effect of increasing filtering stringency for fixed adjusted p-value cutoff α = 0.1. At higher stringencies, both filters eventually reduced rejections. For the ALL data, this effect occurred much more quickly for the overall mean filter. With the overall variance filter, the number of rejections increased by up to 50%. (D) Filtering on overall mean (θ = 0.5 is shown) removed many significant |T_i| (e.g., |T_i| > 4), while filtering on overall variance retained them.

**Fig. 2.**
(A) The null distribution of the test statistic is affected by filtering on the maximum of within-class averages. In this example, all genes have a known common variance, the filter statistic is the maximum of within-class means, and the test statistic is a z-score. The unconditional distribution of the test statistic for nondifferentially expressed genes is a standard normal. Its conditional null distribution, given that the filter statistic (U^I) exceeds a certain threshold (u^∗), however, has much heavier tails. Using the unconditional null distribution to compute p-values after filtering would therefore be inappropriate. See *SI Text* for full details. (B and C) Overall variance filtering and the *limma* moderated t-statistic. Data for 5,000 nondifferentially expressed genes were generated according to the *limma* Bayesian model (n₁ = n₂ = 2, d₀ = 3, ). (B) Filtering on overall variance (θ = 0.5) preferentially eliminated genes with small s_i, causing gene-level standard deviation estimates for genes passing the filter (histogram) to be shifted relative to the unconditional distribution used to generate the data (*dashed curve*). The *limma* inverse χ² model was unable to provide a good fit (*solid curve*) to the s_i passing the filter. (C) The fitting problems lead to a posterior degrees-of-freedom estimate of ∞. As a consequence, p-values were computed using an inappropriate null distribution, producing too many true-null p-values close to zero, i.e., loss of type I error rate control. An analogous analysis comparing biological replicates from the ALL study—so that real array data were used but no gene was expected to exhibit significant differential expression—yielded qualitatively similar results.

**Fig. 3.**
Overall variance (or equivalently, overall standard deviation) filtering example, using the ALL data, comparing 3 BCR/ABL and 3 control subjects. (A) Volcano plot contrasting log-fold change (D_i) with p-value, as obtained from a standard t-test. The impact of filtering is shown: overall variance filtering is equivalent to requiring a minimum fold change—where the bound increases as the p-value decreases. For n₁ = n₂ = 3, the induced fold change bound was essentially constant for p_i < 10^-2 (*dashed line*). As a consequence, the two-stage approach—removing the 50% of genes with lowest overall variance and then applying a standard t-test to what remains—was approximately equivalent to applying a t-test to the full dataset but only rejecting null hypotheses when p_i < 0.01 and the fold change exceeded 1.35× (0.43 on the log ₂ scale). (B) The rate at which the induced fold-change bound converges to its limit depends on sample size. For small samples, this bound, D^∗(p), is essentially a constant multiple of the cutoff on overall standard deviation (u^∗) for all p-values of practical interest; for larger sample sizes, however, genes producing more significant p-values are also subject to a more stringent bound.

See this image and copyright information in PMC

Comment in

Filtering data from high-throughput experiments based on measurement reliability.
Talloen W, Hochreiter S, Bijnens L, Kasim A, Shkedy Z, Amaratunga D, Göhlmann H. Talloen W, et al. Proc Natl Acad Sci U S A. 2010 Nov 16;107(46):E173-4; author reply E175. doi: 10.1073/pnas.1010604107. Epub 2010 Nov 8. Proc Natl Acad Sci U S A. 2010. PMID: 21059952 Free PMC article. No abstract available.

References

1. Kerr MK, Martin M, Churchill GA. Analysis of variance for gene expression microarray data. J Comput Biol. 2000;7:819–837. - PubMed
1. Lönnstedt I, Speed TP. Replicated microarray data. Stat Sinica. 2002;12:31–46.
1. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98:5116–5121. - PMC - PubMed
1. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article 3. - PubMed
1. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23:2881–2887. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Independent filtering increases detection power for high-throughput experiments

Affiliation

Independent filtering increases detection power for high-throughput experiments

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases