Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan;39(1):300-318.
doi: 10.1002/hbm.23843. Epub 2017 Oct 11.

Reproducibility of R-fMRI metrics on the impact of different strategies for multiple comparison correction and sample sizes

Affiliations

Reproducibility of R-fMRI metrics on the impact of different strategies for multiple comparison correction and sample sizes

Xiao Chen et al. Hum Brain Mapp. 2018 Jan.

Abstract

Concerns regarding reproducibility of resting-state functional magnetic resonance imaging (R-fMRI) findings have been raised. Little is known about how to operationally define R-fMRI reproducibility and to what extent it is affected by multiple comparison correction strategies and sample size. We comprehensively assessed two aspects of reproducibility, test-retest reliability and replicability, on widely used R-fMRI metrics in both between-subject contrasts of sex differences and within-subject comparisons of eyes-open and eyes-closed (EOEC) conditions. We noted permutation test with Threshold-Free Cluster Enhancement (TFCE), a strict multiple comparison correction strategy, reached the best balance between family-wise error rate (under 5%) and test-retest reliability/replicability (e.g., 0.68 for test-retest reliability and 0.25 for replicability of amplitude of low-frequency fluctuations (ALFF) for between-subject sex differences, 0.49 for replicability of ALFF for within-subject EOEC differences). Although R-fMRI indices attained moderate reliabilities, they replicated poorly in distinct datasets (replicability < 0.3 for between-subject sex differences, < 0.5 for within-subject EOEC differences). By randomly drawing different sample sizes from a single site, we found reliability, sensitivity and positive predictive value (PPV) rose as sample size increased. Small sample sizes (e.g., < 80 [40 per group]) not only minimized power (sensitivity < 2%), but also decreased the likelihood that significant results reflect "true" effects (PPV < 0.26) in sex differences. Our findings have implications for how to select multiple comparison correction strategies and highlight the importance of sufficiently large sample sizes in R-fMRI studies to enhance reproducibility. Hum Brain Mapp 39:300-318, 2018. © 2017 Wiley Periodicals, Inc.

Keywords: multiple comparison correction strategies; positive predictive value; replicability; reproducibility; resting-state fMRI; sample size; sensitivity; test-retest reliability.

PubMed Disclaimer

Conflict of interest statement

The authors indicate no conflict of interest.

Figures

Figure 1
Figure 1
FWERs of ALFF (without GSR) under 31 kinds of different multiple comparison correction strategies. AFNI 3dClustSim and DPABI AlphaSim are two versions of Monte Carlo simulation based correction implemented in AFNI and DPABI, separately. GRF, PT, and FDR are Gaussian random field correction, permutation test, and false discovery rate correction implemented in DPABI, separately. TFCE stands for threshold‐free cluster enhancement and VOX stands for voxel‐wise correction. Both of them are correction approaches accompanied with PT. The red solid line shows the nominal 5% positive false positive rate, and the gray dashed line shows its approximate theoretical 95% CI, 3.65%–6.35%. [Color figure can be viewed at http://wileyonlinelibrary.com]
Figure 2
Figure 2
Results of the Friedman Test of both test–retest reliabilities and replicabilities regarding between‐subject sex differences and within‐subject EOEC differences on five metrics by two pre‐processing strategies (with and without GSR) among three kinds of cluster‐based correction with the strictest threshold, six kinds of PT based correction and FDR correction (A) test–retest reliability regarding between‐subject sex differences (B) replicability regarding between‐subject sex differences (C) replicability regarding within‐subject EOEC differences. Larger median rank numbers represent the better reproducibility compared with other statistical threshold approaches. PT with TFCE is outlined with red, and those are significantly different from PT with TFCE in reproducibility are outlined with yellow (multiple comparison corrected by Tukey's honest significant difference criterion). GRF, PT, and FDR stand for Gaussian random field correction, permutation test and false discovery rate correction, separately. All versions of cluster‐based corrections are one‐tailed P values while all versions of PT are two tailed P values. [Color figure can be viewed at http://wileyonlinelibrary.com]
Figure 3
Figure 3
Sex differences those are significant in both sessions in the CORR dataset as well as significant in the FCP dataset (“gold standard”), under the correction of PT with TFCE. [Color figure can be viewed at http://wileyonlinelibrary.com]
Figure 4
Figure 4
EOEC differences those are significant in two EOEC datasets, under the correction of PT with TFCE. Different colors indicate voxels’ EOEC differences are significant in only one dataset (dark color) or in both datasets (bright color). [Color figure can be viewed at http://wileyonlinelibrary.com]
Figure 5
Figure 5
Test–retest reliability (Dice index), sensitivity and PPV on ALFF (without GSR) as functions of sample size.
Figure 6
Figure 6
Effect sizes (Cohen's f 2) of between‐subject sex differences (A) calculated with the first session from CORR dataset, n = 420) and within‐subject EOEC differences (B) calculated with the Beijing EOEC1 dataset, n = 48). Cohen's f 2 were thresholded at f 2 > 0.02 (small effect size). [Color figure can be viewed at http://wileyonlinelibrary.com]

References

    1. Allen EA, Erhardt EB, Damaraju E, Gruner W, Segall JM, Silva RF, Havlicek M, Rachakonda S, Fries J, Kalyanam R, Michael AM, Caprihan A, Turner JA, Eichele T, Adelsheim S, Bryan AD, Bustillo J, Clark VP, Feldstein Ewing SW, Filbey F, Ford CC, Hutchison K, Jung RE, Kiehl KA, Kodituwakku P, Komesu YM, Mayer AR, Pearlson GD, Phillips JP, Sadek JR, Stevens M, Teuscher U, Thoma RJ, Calhoun VD (2011): A baseline for the multivariate comparison of resting‐state networks. Front Syst Neurosci 5:2. - PMC - PubMed
    1. Altman DG, Bland JM (1994): Statistics Notes: Diagnostic tests 1: sensitivity and specificity. Br Med J 308:1552. - PMC - PubMed
    1. Anderson JS, Druzgal TJ, Froehlich A, DuBray MB, Lange N, Alexander AL, Abildskov T, Nielsen JA, Cariello AN, Cooperrider JR, Bigler ED, Lainhart JE (2011): Decreased interhemispheric functional connectivity in autism. Cereb Cort 21:1134–1146. - PMC - PubMed
    1. Ashburner J (2007): A fast diffeomorphic image registration algorithm. NeuroImage 38:95–113. - PubMed
    1. Ashburner J, Friston KJ (2005): Unified segmentation. NeuroImage 26:839–851. - PubMed

Publication types