. 2018 Sep 14;19(1):323.

doi: 10.1186/s12859-018-2356-2.

Using controls to limit false discovery in the era of big data

Matthew M Parks¹, Benjamin J Raphael², Charles E Lawrence^{3

4}

Affiliations

¹ Department of Physiology and Biophysics, Weill Cornell Medicine, 1300 York Ave, New York, NY, 10065, USA.
² Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ, 08540, USA.
³ Center for Computational Molecular Biology, Brown University, 115 Waterman Street, Providence, RI, 02912, USA. charles_lawrence@brown.edu.
⁴ Division of Applied Mathematics, Brown University, 182 George Street, Providence, RI, 02912, USA. charles_lawrence@brown.edu.

PMID: 30217148
PMCID: PMC6137876
DOI: 10.1186/s12859-018-2356-2

Using controls to limit false discovery in the era of big data

Matthew M Parks et al. BMC Bioinformatics. 2018.

. 2018 Sep 14;19(1):323.

doi: 10.1186/s12859-018-2356-2.

Authors

Matthew M Parks¹, Benjamin J Raphael², Charles E Lawrence^{3

4}

Affiliations

¹ Department of Physiology and Biophysics, Weill Cornell Medicine, 1300 York Ave, New York, NY, 10065, USA.
² Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ, 08540, USA.
³ Center for Computational Molecular Biology, Brown University, 115 Waterman Street, Providence, RI, 02912, USA. charles_lawrence@brown.edu.
⁴ Division of Applied Mathematics, Brown University, 182 George Street, Providence, RI, 02912, USA. charles_lawrence@brown.edu.

PMID: 30217148
PMCID: PMC6137876
DOI: 10.1186/s12859-018-2356-2

Abstract

Background: Procedures for controlling the false discovery rate (FDR) are widely applied as a solution to the multiple comparisons problem of high-dimensional statistics. Current FDR-controlling procedures require accurately calculated p-values and rely on extrapolation into the unknown and unobserved tails of the null distribution. Both of these intermediate steps are challenging and can compromise the reliability of the results.

Results: We present a general method for controlling the FDR that capitalizes on the large amount of control data often found in big data studies to avoid these frequently problematic intermediate steps. The method utilizes control data to empirically construct the distribution of the test statistic under the null hypothesis and directly compares this distribution to the empirical distribution of the test data. By not relying on p-values, our control data-based empirical FDR procedure more closely follows the foundational principles of the scientific method: that inference is drawn by comparing test data to control data. The method is demonstrated through application to a problem in structural genomics.

Conclusions: The method described here provides a general statistical framework for controlling the FDR that is specifically tailored for the big data setting. By relying on empirically constructed distributions and control data, it forgoes potentially problematic modeling steps and extrapolation into the unknown tails of the null distribution. This procedure is broadly applicable insofar as controlled experiments or internal negative controls are available, as is increasingly common in the big data setting.

Keywords: Big data; False discovery rate (FDR); High dimensional inference; Hypothesis testing.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Empirical probability density functions f and f_c for the observed read depth ratios for the test and control data, respectively. Both density functions were obtained by kernel density estimation with a Normal kernel. The vertical black line indicates y = 1

**Fig. 2**
Probability density functions for the test distribution, mode-shifted control distribution, and 1-, 2-, 3-, and 4- component Gaussian mixtures fitted to the central region of the test data. The vertical dotted black line indicates the mode of the test data. The vertical solid black lines indicate the boundaries of the half-height region

See this image and copyright information in PMC

Cited by

Systematic review and meta-analysis of the association between ABCA7 common variants and Alzheimer's disease in non-Hispanic White and Asian cohorts.
Liu D, Zhang H, Liu C, Liu J, Liu Y, Bai N, Zhou Q, Xu Z, Li L, Liu H. Liu D, et al. Front Aging Neurosci. 2024 Oct 17;16:1406573. doi: 10.3389/fnagi.2024.1406573. eCollection 2024. Front Aging Neurosci. 2024. PMID: 39484364 Free PMC article.
F. prausnitzii potentially modulates the association between citrus intake and depression.
Samuthpongtorn C, Chan AA, Ma W, Wang F, Nguyen LH, Wang DD, Okereke OI, Huttenhower C, Chan AT, Mehta RS. Samuthpongtorn C, et al. Microbiome. 2024 Nov 14;12(1):237. doi: 10.1186/s40168-024-01961-3. Microbiome. 2024. PMID: 39543781 Free PMC article.

References

1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57:289–300.
1. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. - DOI - PMC - PubMed
1. Scott JG, Kelly RC, Smith MA, Zhou P, Kass RE. False discovery rate regression: an application to neural synchrony detection in primary visual cortex. J Am Stat Assoc. 2015;110:459–471. doi: 10.1080/01621459.2014.990973. - DOI - PMC - PubMed
1. Jager LR, Leek JT. An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics. 2014;15:1–12. doi: 10.1093/biostatistics/kxt007. - DOI - PubMed
1. Benjamini Y, Heller R. False discovery rates for spatial signals. J Am Stat Assoc. 2007;102:1272–1281. doi: 10.1198/016214507000000941. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using controls to limit false discovery in the era of big data

Affiliations

Using controls to limit false discovery in the era of big data

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases