Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 Jul;23(7):788-799.
doi: 10.1038/s41593-020-0660-4. Epub 2020 Jun 29.

Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence

Affiliations
Review

Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence

Christian Keysers et al. Nat Neurosci. 2020 Jul.

Erratum in

Abstract

Most neuroscientists would agree that for brain research to progress, we have to know which experimental manipulations have no effect as much as we must identify those that do have an effect. The dominant statistical approaches used in neuroscience rely on P values and can establish the latter but not the former. This makes non-significant findings difficult to interpret: do they support the null hypothesis or are they simply not informative? Here we show how Bayesian hypothesis testing can be used in neuroscience studies to establish both whether there is evidence of absence and whether there is absence of evidence. Through simple tutorial-style examples of Bayesian t-tests and ANOVA using the open-source project JASP, this article aims to empower neuroscientists to use this approach to provide compelling and rigorous evidence for the absence of an effect.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: EJ Wagenmakers declares that he coordinates the development of the open-source software package JASP (jasp-stats.org), a non-commercial, publicly-funded effort to make Bayesian statistics accessible to a broader group of researchers and students. CK and VG declare not having any competing interests.

Figures

Extended Data Figure 1
Extended Data Figure 1. The relationship between BF, p, and effect sizes values. (a)
This log-log plot shows the BF+0 values corresponding to familiar critical p values for a one-tailed one-sample t-test at different sample sizes (n). The curves show the BF+0 values obtained in a Bayesian t-test based on the critical t-value that provides p=0.05 (yellow), p=0.01 (green), p=0.005 (black) and p=0.001 (black). The yellow dashed horizontal line indicates the BF+0=3 bound for moderate evidence considered by Jeffreys to be similar to p=0.05, the green one the BF+0=10 for strong evidence considered similar to p=0.01. The two black dashed lines mark BF+0=1, i.e. the line of no evidence, and BF+0=1/3, the bound for moderate evidence of absence. The background gradient reminds the reader that the BF reference values of 3 and 10 should not be considered hard bounds. Instead the BF should be interpreted as a continuous value, with values diverging more from 1 supporting stronger conclusions. This panel makes two points. First, there is no simple equivalence between p and BF that holds over all sample sizes. This is because in a frequentist t-test, the observed effect size (d) sufficient to generate a specific p value decreases with √n more rapidly than for the BF. As a result, at large n, very small effect sizes generate ‘significant’ t-test: at n=1000, the critical t-value for a one-tailed p=0.05 is 1.65, corresponding to d=1.65/√n =0.05. For the BF, such a minuscule effect is 4 times more likely under H0 than H+ (BF+0=0.26). Hence, for small sample sizes p and BF support similar conclusions (e.g., p=0.05 at n=4 corresponds to BF+0>3, supporting the same conclusion of evidence for an effect), but for large sample sizes the frequentist and Bayesian conclusions can diverge in the presence of very small effect sizes (e.g., p=0.05 at n=1000 corresponds to BF+0<1/3)p207, 38. Considering confidence or credible intervals of the effect size in addition to p or BF values helps interpret such cases. Second, the fact that the dashed lines are above the curve of the same colour for all n>4 shows that BF+0=3 and BF+0=10 indeed protect against Type I errors in a frequentist sense at least at p=0.05 or p=0.01, respectively. In other words, if BF10>3, p<0.05, and if BF10>10, p<0.01, but how much lower than 0.05 or 0.01 the exact p-value is, depends on n. (b) BF +0 (left) and p (right) values as a function of measured effect- and sample-sizes. These panels illustrate the measured effect sizes necessary to provide evidence for an effect at different sample sizes in a one-sample one-tailed t-test using the BF vs. traditional p values. Each curve connects the results at different sample sizes for the specified value of d. The logarithmic BF and p scales are aligned so as to place BF=3 next to p=0.05, and BF=10 next to p=0.01.
Extended Data Figure 2
Extended Data Figure 2. Evidence for or against a factor in a Bayesian ANOVA
A Bayesian ANOVA is a form of model comparison. This figure illustrates how the Bayes factor can provide evidence for a simpler model by concentrating its predictions on a single parameter value. This example ANOVA determines whether or not the data D depend on the value of the factor Group by comparing the Null Model D=0*Group (left) against the Group Model D=β*Group, with a Cauchy prior on β (right). The top row illustrates the prior probability attributed to the different values of β under the two competing models. Note how both models include β = = 0 as a possibility, but given that the probability values must integrate to 1 over the entire β space, for the Null Model p(β = 0) = 1 while for the Group Model, the probability is distributed across all plausible alternative values. The middle row shows the predicted t-values based on these priors, where t represents the difference between the data from the two groups as in Figure 2. Note how these predictions are more peaked for the Null compared to the Group model. The bottom row compares the predicted probability of finding particular t-values under the two models, and shows how values close to zero (i.e., small or no difference between the groups) are predicted more often by the Null compared to the Group Model, while the opposite is true for large t-values. If conducting the experiment reveals a measured t-values close to zero, the Bayes Factor for including the factor Group would be substantially below 1, providing evidence for the absence of an effect of Group, while the inverse would be true for high t-values
Box 1 Figure
Box 1 Figure
A probability wheel representation of a Bayes factor of 10 in favor of H0. The circle has area 1.
Figure 1
Figure 1. P-value of a t-test and BF+0 as a function of effect size and sample size.
(A) Each histogram shows the distribution of p-values obtained from 1000 one-tailed one-sample t-tests based on n random numbers drawn from a normal distribution with mean μ and sd=1. To differentiate levels of significance, the first bin was split in multiple bins based on standard critical values. Note how, when there is an effect in the data (i.e., μ>0, all but leftmost column), increasing sample size (downwards) or effect size (rightwards) leads to a leftwards shift of the distribution: more evidence for an effect leads to lower p-values. In this case, p-values<0.05 are considered hits, and are shown in green, while p>0.05 are considered misses and shown in red. However, somewhat counterintuitively, the converse does not hold true: in the absence of an effect, (μ=0, leftwards column), increasing sample size does not lead to a rightward shift (increase) of the p-values. Instead the distribution is completely flat, with all p-values equally likely (note that the distribution seems to thin out below 0.05, but this is because we subdivided the last leftmost bin into several bins to resolve levels of significance). In this case, p<0.05 are false alarms, shown in red, and p>0.05 are correct rejections, shown in green. P-values are thus not a symmetrical instrument: cases with much evidence for H1 (high effect size and sample size) give us quasi certainty to find a very low p-value, whereas cases with much evidence for H0 (e.g., μ=0 with n=100) do not make p-values close to 1 highly likely -- instead, any p-value remains as likely as any other. (B) Distribution of BF+0 (using r=2/2 for the effect size prior Cauchy width) values obtained from 1000 t-test based on n random numbers drawn from a N(μ,1) normal distribution with mean μ and sd=1. Each histogram has the same bounds specified below the graphs, representing conventional limits for moderate and strong evidence. When an effect is absent (μ=0, leftmost column), evidence of absence (green bars and percentages, BF+0<1/3) increases with increasing sample size, and false alarm rate is well controlled. When an effect is present (μ>0), evidence for a positive effect (BF+0>3, green bars and green percentages) increases with sample size and effect size, and misses (BF+0<3, red bars and red percentages) are rare (μ=0.5) or absent (μ=1.2 or 2). When percentages are not shown, they are 0% (red) or 100% (green). Data can be found at https://osf.io/md9kp/.
Figure 2
Figure 2. Hypothesis testing under the Bayesian Framework
(A) Two competing qualitative hypotheses are expressed in terms of a test parameter such as the population effect size δ. H+ represents a directional hypothesis of a positive effect size. (B) The two rival hypotheses are formulated in terms of specific probability distributions expressing the plausibility or probability of each effect size value. (C) Each effect size distribution is transformed into expected t-values. For H0, this is simply the standard t-distribution used in frequentist t-tests. For H+, for each hypothesized effect size, a non-central t-distribution with that effect size is multiplied with the hypothesized probability of that effect size in B. All of these weighted non-central t-distributions are then summed together to get the distribution in C. (D) After the data is obtained, the observed t-value (t1) can be interrogated in each distribution. Note that, in frequentist statistics, the p-value is derived from the H0 distribution alone, as the area where t > t1. (E) The likelihood of t1 under H0 and H+ is then compared to calculate the BF+0. Here we illustrate three examples of observed t-values. At an observed value of t1, the blue distribution is 4 times higher than the red; hence BF+0=4, and we have (moderate) evidence for H+. At an observed value of t2, where the two distributions are equal, BF+0 = 1 and we have absence of evidence. At an observed value of t3, the red distribution is 4 times higher than the blue; hence BF0+ = 4 and we have moderate evidence for H0. Here we illustrated one-tailed hypotheses, as these respect the directional nature of the underlying theory and yield more diagnostic predictions. More agnostic two-tailed hypotheses are calculated using the same principles, but the truncated blue distribution in B is then replaced with a untruncated, symmetric distribution, as shown in the dotted line in Figure 6B. Data can be found at https://osf.io/md9kp/
Figure 3
Figure 3. Illustration of the data for the two simulated scenarios
Muscimol1 data were simulated using μ=70 and σ=20 for all conditions (imposing a floor of 0 and a ceiling of 100), except ShockObs (in blue) under Muscimol which was simulated using μ=40. Muscimol2 data was simulated using the same parameters except for CS (in orange) under Muscimol, which had μ=65 and σ=40. Based on these data we should find evidence for H+:Saline>Muscimol in all cases for ShockObs. For CS (orange), Muscimol1 should reveal evidence for H0 (evidence of absence) given that data were drawn from the same μ=70, σ=20 distributions. For Muscimol2, CS was drawn from different distributions for saline and muscimol, but with n=20, it might be hard to adjudicate the difference, and we might thus expect absence of evidence. Data can be found at Data can be found at https://osf.io/md9kp/. Plots are violin plots, with the gray bar showing the middle two quartiles.
Figure 4
Figure 4. Screenshot from the Bayesian Independent Samples T-Test in JASP
The top right shows the Bayes factor for the two variables, followed by the inferential plot showing the credible interval of the effect size and the sequential analysis. The inferential plots shown on the right will be discussed in sections 4 and 5. Data can be found at https://osf.io/md9kp/
Figure 5
Figure 5. Screenshot of the Bayesian repeated measures ANOVA of Muscimol1
The muscimol1.jasp analysis file can be downloaded at https://osf.io/md9kp/
Figure 6
Figure 6. Further outputs for the Bayesian t-test on Muscimol1.csv
(A) Clicking the option Bayes Factor Robustness Check will plot for each variable (ShockObs on the left and CS on the right) the BF as a function of the effect size prior. The user prior (gray) is by default set at Cauchy scale 0.707 as recommended in. The wide and ultrawide prior are flatter priors that are sometimes used, especially when the goal is parameter estimation. As can be seen, there is extreme evidence for H1 in ShockObs, across all but the smallest priors (i.e., the gray, black and white dots all have BF+0>160), and there is moderate evidence for H0 for all but the smallest priors for CS (most BF0+>4.5). The interpretation of the data does thus not depend on the choice of prior scale within a reasonable range. (B) Priors and Posteriors for ShockObs and CS together with median and CI of the effect size. Results are shown for a one-tailed prior (top row) often more suited for hypothesis testing and two-tailed prior (bottom row) more suited for parameter estimation. (C) Accumulation of evidence with increasing sample size using the ‘Sequential analysis’ option. Data can be found athttps://osf.io/md9kp/

References

    1. Benjamin DJ, et al. Redefine Statistical Significance. Nat Hum Behav. 2018;2:6–10. - PubMed
    1. Dienes Z. Using B ayes to Get the Most out of Non-Significant Results. Front Psycholology. 2014;5:781. - PMC - PubMed
    1. Gallistel CR. The Importance of Proving the Null. Psychol Rev. 2009;116:439–453. - PMC - PubMed
    1. Rouder JN, Speckman PL, Sun D, Morey RD, Iverson G. Bayesian t Tests for Accepting and Rejecting the Null Hypothesis. Psychon Bull Rev. 2009;16:225–237. - PubMed
    1. Love J, et al. JASP: Graphical statistical software for common statistical designs. J Stat Softw. 2019;88:1–17.

Publication types