Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 7:5:e3544.
doi: 10.7717/peerj.3544. eCollection 2017.

The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research

Affiliations

The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research

Valentin Amrhein et al. PeerJ. .

Abstract

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

Keywords: Graded evidence; Nonsignificant; P-value; Publication bias; Replicability; Reproducibility; Significant; Threshold; Truth inflation; Winner’s curse.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1
Figure 1. Averages and 95% confidence intervals from five simulated studies.
P-values are from one sample t-tests, and sample sizes are n = 30 each (adapted from Korner-Nievergelt & Hüppop, 2016). Results A and E are relatively incompatible with the null hypothesis that the true effect size (the population average) is zero. Note that p-values in A versus D, or B versus C, are very different, although the estimates have the same precision and are thus equally reliable. Note also that the p-value in A is smaller than in E although variation is larger, because the point estimate in A is farther off the null value. If we define effect sizes between 1 and 2 as scientifically or practically important, result A is strong evidence that the effect is important, and result E is clear evidence that the effect is not important, because the small effect size was estimated with high precision. Result B is relatively clear evidence that the effect is not strongly negative and could be important, given that a value close to the center of a 95% confidence interval is about seven times as likely to be the true population parameter as is a value near a limit of the interval (Cumming, 2014). Result C is only very weak evidence against the null hypothesis, and because plausibility for the parameter is greatest near the point estimate, we may say that the true population average could be relatively close to zero. However, result C also shows why a large p-value cannot be used to ‘confirm’ or ‘support’ the null hypothesis: first, the point estimate is larger than zero, thus the null hypothesis of a zero effect is not the hypothesis most compatible with the data. Second, the confidence interval shows possible population averages that would be consistent with the data and that could be strongly negative, or positive and even practically important. Because of this large uncertainty covering qualitatively very different parameter values, we should refrain from drawing conclusions about practical consequences based on result C. In contrast, result D is only weak evidence against the null hypothesis, but precision is sufficient to infer that possible parameter values are not far off the null and that the effect is practically not important. Result C is thus a case in which the large p-value and the wide confidence interval roughly say the same, which is that inference is difficult. Results B and D can be meaningfully interpreted even though p-values are relatively large.

References

    1. Academy of Medical Sciences Reproducibility and reliability of biomedical research: improving research practice. Academy of Medical Sciences, BBSRC, MRC, Wellcome TrustSymposium report. 2015
    1. Anderson DR, Burnham KP, Thompson WL. Null hypothesis testing: problems, prevalence, and an alternative. Journal of Wildlife Management. 2000;64:912–923. doi: 10.2307/3803199. - DOI
    1. Badenes-Ribera L, Frias-Navarro D, Iotti B, Bonilla-Campos A, Longobardi C. Misconceptions of the p-value among Chilean and Italian academic psychologists. Frontiers in Psychology. 2016;7 doi: 10.3389/fpsyg.2016.01247. Article 1247. - DOI - PMC - PubMed
    1. Baker M. Is there a reproducibility crisis? Nature. 2016;533:452–454. doi: 10.1038/533452a. - DOI - PubMed
    1. Barber JJ, Ogle K. To P or not to P? Ecology. 2014;95:621–626. doi: 10.1890/13-1402.1. - DOI - PubMed

LinkOut - more resources