Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain
- PMID: 18288577
- PMCID: PMC2270357
- DOI: 10.1007/s10654-008-9230-x
Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain
Abstract
There is growing concern in the scientific community that many published scientific findings may represent spurious patterns that are not reproducible in independent data sets. A reason for this is that significance levels or confidence intervals are often applied to secondary variables or sub-samples within the trial, in addition to the primary hypotheses (multiple hypotheses). This problem is likely to be extensive for population-based surveys, in which epidemiological hypotheses are derived after seeing the data set (hypothesis fishing). We recommend a data-splitting procedure to counteract this methodological problem, in which one part of the data set is used for identifying hypotheses, and the other is used for hypothesis testing. The procedure is similar to two-stage analysis of microarray data. We illustrate the process using a real data set related to predictors of low back pain at 14-year follow-up in a population initially free of low back pain. "Widespreadness" of pain (pain reported in several other places than the low back) was a statistically significant predictor, while smoking was not, despite its strong association with low back pain in the first half of the data set. We argue that the application of data splitting, in which an independent party handles the data set, will achieve for epidemiological surveys what pre-registration has done for clinical studies.
References
-
- Abdi H. Bonferroni, Sidak corrections for multiple comparisons. In: Salkind NJ (ed) Encyclopedia of Measurement and Statistics. Thousand Oaks CA: Sage; 2007.
-
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Stat Soc Ser B-methodological 1995;57(1):289–300.
-
- Madigan D, Raftery AE. Model selection and accounting for model uncertainty in graphical models using Occam’s window. J Am Stat Assoc 1994;89:1535–46. - DOI
-
- Faraway JJ. Data splitting strategies for reducing the effect of model selection on inference. Comput Sci Stat 1998;30:332–41.
MeSH terms
LinkOut - more resources
Full Text Sources
