Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Meta-Analysis
. 2022 Sep;43(13):3987-3997.
doi: 10.1002/hbm.25898. Epub 2022 May 10.

Evaluation of thresholding methods for activation likelihood estimation meta-analysis via large-scale simulations

Affiliations
Meta-Analysis

Evaluation of thresholding methods for activation likelihood estimation meta-analysis via large-scale simulations

Lennart Frahm et al. Hum Brain Mapp. 2022 Sep.

Abstract

In recent neuroimaging studies, threshold-free cluster enhancement (TFCE) gained popularity as a sophisticated thresholding method for statistical inference. It was shown to feature higher sensitivity than the frequently used approach of controlling the cluster-level family-wise error (cFWE) and it does not require setting a cluster-forming threshold at voxel level. Here, we examined the applicability of TFCE to a widely used method for coordinate-based neuroimaging meta-analysis, Activation Likelihood Estimation (ALE), by means of large-scale simulations. We created over 200,000 artificial meta-analysis datasets by independently varying the total number of experiments included and the amount of spatial convergence across experiments. Next, we applied ALE to all datasets and compared the performance of TFCE to both voxel-level and cluster-level FWE correction approaches. All three multiple-comparison correction methods yielded valid results, with only about 5% of the significant clusters being based on spurious convergence, which corresponds to the nominal level the methods were controlling for. On average, TFCE's sensitivity was comparable to that of cFWE correction, but it was slightly worse for a subset of parameter combinations, even after TFCE parameter optimization. cFWE yielded the largest significant clusters, closely followed by TFCE, while voxel-level FWE correction yielded substantially smaller clusters, showcasing its high spatial specificity. Given that TFCE does not outperform the standard cFWE correction but is computationally much more expensive, we conclude that employing TFCE for ALE cannot be recommended to the general user.

Keywords: FWE; family-wise error; multiple comparison correction; neuroimaging meta-analysis; significance thresholding; threshold-free cluster enhancement cluster extent.

PubMed Disclaimer

Conflict of interest statement

The authors declare no potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Simulation of an experiment. Two independent draws from the filtered Brainmap database were used to determine the sample size and the number of foci reported by the experiment. Next, we sampled the corresponding number of coordinates from a lenient gray‐matter mask. Last, the first coordinate got replaced by the true coordinate multiplied with a displacement factor. This last step only happened if the experiment was an experiment activating the target location
FIGURE 2
FIGURE 2
Behavior of ALE scores and the corresponding p‐values under the different levels of the two simulation parameters (number of experiments and number of experiments activating the target location) and their 341 combinations. The total number of experiments included in the ALE analysis is color coded in a spectral sequence from 15 experiments (purple) to 45 experiments (red). (a) Average ALE‐score (over 500 iterations) at the ground‐truth location. ALE scores increased linearly as a function of the number of experiments activating the target location but also with the total number of experiments due to the increased chance of (positive) interference by noise foci. (b) Average p‐value over 500 iterations at the ground‐truth location. p‐values decreased with a higher number of experiments activating the target location. p‐values increased with the total number of experiments because of a right shift of the null‐distribution. (c) ALE scores versus p‐values at the ground‐truth location for all 170,500 simulations. The more experiments are included in an ALE analysis, the more convergence (higher ALE score) was needed to obtain the same p‐values.
FIGURE 3
FIGURE 3
(a–c) Sensitivity of ALE when applying different multiple‐comparison correction methods for statistical inference. The number of experiments activating the target location is represented on the x‐axis, while each total number of experiments has its own curve in the graph following a spectral color sequence (15—purple; 45—red). The curves show the average sensitivity over the 500 iterations of each parameter combination. For all three methods, sensitivity increased in an approximately sigmoid fashion as a function of the number of experiments activating the target location. Additionally, having more experiments in the dataset required having more experiments activating the target location to achieve the same sensitivity. (d) Zooming in on the difference in sensitivity between cFWE correction and TFCE: The differences between individual dataset sizes are displayed in gray and the average over all dataset sizes in red. cFWE correction performed better on average, especially between 4 and 8 experiments activating the target location. There were a few dataset sizes in which TFCE has a slight sensitivity advantage at 3–4 experiments activating the target location.
FIGURE 4
FIGURE 4
(a–c) Cluster size of statistically significant areas of convergence that include at least one voxel in a 4‐mm radius around the true location, under the different levels of the two simulation parameters (number of experiments and number of experiments activating the target location) and their 341 combinations. The number of experiments activating the target location was strongly positively correlated with cluster size, while the total number of experiments showed a negative correlation. cFWE correction featured the largest clusters closely followed by TFCE. The clusters declared significant by vFWE correction were exceedingly small in comparison. (d) Zooming in on the difference in cluster sizes between cFWE and TFCE corrections, it can be observed that the difference became more pronounced with fewer experiments activating the target location. This is because cFWE correction will always only result in relatively large clusters, while TFCE can potentially yield single significant voxels. This difference was more pronounced at lower convergence levels.
FIGURE 5
FIGURE 5
The likelihood of additional significant clusters as a function of the number of experiments activating the target location, averaged across the total number of experiments (blue line). As can be seen, all three multiple‐comparison correction methods largely succeeded at controlling for an alpha error of .05.
FIGURE 6
FIGURE 6
(a) Sensitivity of ALE in a large‐scale meta‐analysis setting when applying different multiple comparison correction methods for statistical inference. The general trend observed in the main simulations holds for large‐scale datasets as well. Sensitivity increased as a function of experiments activating the target location in a sigmoid fashion and the lower the rate of experiments activating the target location is in comparison to the total number of experiments, the lower sensitivity became. Lower right: Zooming in on the difference between cFWE and TFCE corrections: Even though TFCE performed slightly better at 7 and 8 experiments activating the target location for datasets including 150 studies, cFWE showed higher sensitivity on average. (b) Sensitivity of ALE corrected with TFCE using different parameter levels for the E and H exponent looking at a dataset of n = 30 experiments. The standard setting (indicated in red), described in the literature as a fixed setting, is H = 2 and E = 0.5. We used combinations of H = [1.8, 2.0, 2.2] with E = [0.3, 0.5, 0.7] (indicated in gray) to see if other values would improve performance. Overall, the standard parameter setting performed best or at least on par with other parameter settings.
FIGURE 7
FIGURE 7
Computation time required for a single null permutation of each multiple‐comparison correction method. Times measured for 50 datasets per dataset size (15–45), totaling 1550 timepoints. vFWE and cFWE corrections run almost equally fast, while TFCE takes up to nine times as much time as the other two methods.

Similar articles

Cited by

References

    1. Acar, F. , Seurinck, R. , Eickhoff, S. B. , & Moerkerke, B. (2018). Assessing robustness against potential publication bias in Activation Likelihood Estimation (ALE) meta‐analyses for fMRI. PLoS One, 13(11), e0208177. 10.1371/journal.pone.0208177 - DOI - PMC - PubMed
    1. Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature News, 533(7604), 452–454. - PubMed
    1. Bossier, H. , Roels, S. P. , Seurinck, R. , Banaschewski, T. , Barker, G. J. , Bokde, A. L. , Quinlan, E. B. , Desrivières, S. , Flor, H. , & Grigis, A. (2020). The empirical replicability of task‐based fMRI as a function of sample size. NeuroImage, 212, 116601. - PubMed
    1. Botvinik‐Nezer, R. , Holzmeister, F. , Camerer, C. F. , Dreber, A. , Huber, J. , Johannesson, M. , Kirchler, M. , Iwanir, R. , Mumford, J. A. , Adcock, R. A. , Avesani, P. , Baczkowski, B. M. , Bajracharya, A. , Bakst, L. , Ball, S. , Barilari, M. , Bault, N. , Beaton, D. , Beitner, J. , … Schonberg, T. (2020). Variability in the analysis of a single neuroimaging dataset by many teams. Nature, 582(7810), 84–88. - PMC - PubMed
    1. Chen, X. , Lu, B. , & Yan, C. G. (2018). Reproducibility of R‐fMRI metrics on the impact of different strategies for multiple comparison correction and sample sizes. Human Brain Mapping, 39(1), 300–318. - PMC - PubMed

Publication types