Reproducibility of R-fMRI metrics on the impact of different strategies for multiple comparison correction and sample sizes

Xiao Chen^{1

2}, Bin Lu^{1

2}, Chao-Gan Yan^{1

2

3

4}

Affiliations

¹ CAS Key Laboratory of Behavioral Science, Institute of Psychology, Beijing, China.
² Department of Psychology, University of Chinese Academy of Sciences, Beijing, China.
³ Magnetic Resonance Imaging Research Center, Institute of Psychology, Chinese Academy of Sciences, Beijing, China.
⁴ Department of Child and Adolescent Psychiatry, NYU Langone Medical Center, School of Medicine, New York, NY, USA.

PMID: 29024299
PMCID: PMC6866539
DOI: 10.1002/hbm.23843

Reproducibility of R-fMRI metrics on the impact of different strategies for multiple comparison correction and sample sizes

Xiao Chen et al. Hum Brain Mapp. 2018 Jan.

. 2018 Jan;39(1):300-318.

doi: 10.1002/hbm.23843. Epub 2017 Oct 11.

Authors

Xiao Chen^{1

2}, Bin Lu^{1

2}, Chao-Gan Yan^{1

2

3

4}

Affiliations

¹ CAS Key Laboratory of Behavioral Science, Institute of Psychology, Beijing, China.
² Department of Psychology, University of Chinese Academy of Sciences, Beijing, China.
³ Magnetic Resonance Imaging Research Center, Institute of Psychology, Chinese Academy of Sciences, Beijing, China.
⁴ Department of Child and Adolescent Psychiatry, NYU Langone Medical Center, School of Medicine, New York, NY, USA.

PMID: 29024299
PMCID: PMC6866539
DOI: 10.1002/hbm.23843

Abstract

Concerns regarding reproducibility of resting-state functional magnetic resonance imaging (R-fMRI) findings have been raised. Little is known about how to operationally define R-fMRI reproducibility and to what extent it is affected by multiple comparison correction strategies and sample size. We comprehensively assessed two aspects of reproducibility, test-retest reliability and replicability, on widely used R-fMRI metrics in both between-subject contrasts of sex differences and within-subject comparisons of eyes-open and eyes-closed (EOEC) conditions. We noted permutation test with Threshold-Free Cluster Enhancement (TFCE), a strict multiple comparison correction strategy, reached the best balance between family-wise error rate (under 5%) and test-retest reliability/replicability (e.g., 0.68 for test-retest reliability and 0.25 for replicability of amplitude of low-frequency fluctuations (ALFF) for between-subject sex differences, 0.49 for replicability of ALFF for within-subject EOEC differences). Although R-fMRI indices attained moderate reliabilities, they replicated poorly in distinct datasets (replicability < 0.3 for between-subject sex differences, < 0.5 for within-subject EOEC differences). By randomly drawing different sample sizes from a single site, we found reliability, sensitivity and positive predictive value (PPV) rose as sample size increased. Small sample sizes (e.g., < 80 [40 per group]) not only minimized power (sensitivity < 2%), but also decreased the likelihood that significant results reflect "true" effects (PPV < 0.26) in sex differences. Our findings have implications for how to select multiple comparison correction strategies and highlight the importance of sufficiently large sample sizes in R-fMRI studies to enhance reproducibility. Hum Brain Mapp 39:300-318, 2018. © 2017 Wiley Periodicals, Inc.

Keywords: multiple comparison correction strategies; positive predictive value; replicability; reproducibility; resting-state fMRI; sample size; sensitivity; test-retest reliability.

PubMed Disclaimer

Conflict of interest statement

The authors indicate no conflict of interest.

Figures

**Figure 1**
FWERs of ALFF (without GSR) under 31 kinds of different multiple comparison correction strategies. AFNI 3dClustSim and DPABI AlphaSim are two versions of Monte Carlo simulation based correction implemented in AFNI and DPABI, separately. GRF, PT, and FDR are Gaussian random field correction, permutation test, and false discovery rate correction implemented in DPABI, separately. TFCE stands for threshold‐free cluster enhancement and VOX stands for voxel‐wise correction. Both of them are correction approaches accompanied with PT. The red solid line shows the nominal 5% positive false positive rate, and the gray dashed line shows its approximate theoretical 95% CI, 3.65%–6.35%. [Color figure can be viewed at http://wileyonlinelibrary.com]

**Figure 2**
Results of the Friedman Test of both test–retest reliabilities and replicabilities regarding between‐subject sex differences and within‐subject EOEC differences on five metrics by two pre‐processing strategies (with and without GSR) among three kinds of cluster‐based correction with the strictest threshold, six kinds of PT based correction and FDR correction (A) test–retest reliability regarding between‐subject sex differences (B) replicability regarding between‐subject sex differences (C) replicability regarding within‐subject EOEC differences. Larger median rank numbers represent the better reproducibility compared with other statistical threshold approaches. PT with TFCE is outlined with red, and those are significantly different from PT with TFCE in reproducibility are outlined with yellow (multiple comparison corrected by Tukey's honest significant difference criterion). GRF, PT, and FDR stand for Gaussian random field correction, permutation test and false discovery rate correction, separately. All versions of cluster‐based corrections are one‐tailed P values while all versions of PT are two tailed P values. [Color figure can be viewed at http://wileyonlinelibrary.com]

**Figure 3**
Sex differences those are significant in both sessions in the CORR dataset as well as significant in the FCP dataset (“gold standard”), under the correction of PT with TFCE. [Color figure can be viewed at http://wileyonlinelibrary.com]

**Figure 4**
EOEC differences those are significant in two EOEC datasets, under the correction of PT with TFCE. Different colors indicate voxels’ EOEC differences are significant in only one dataset (dark color) or in both datasets (bright color). [Color figure can be viewed at http://wileyonlinelibrary.com]

**Figure 5**
Test–retest reliability (Dice index), sensitivity and PPV on ALFF (without GSR) as functions of sample size.

**Figure 6**
Effect sizes (Cohen's f ²) of between‐subject sex differences (A) calculated with the first session from CORR dataset, n = 420) and within‐subject EOEC differences (B) calculated with the Beijing EOEC1 dataset, n = 48). Cohen's f ² were thresholded at f ² > 0.02 (small effect size). [Color figure can be viewed at http://wileyonlinelibrary.com]

See this image and copyright information in PMC

References

1. Allen EA, Erhardt EB, Damaraju E, Gruner W, Segall JM, Silva RF, Havlicek M, Rachakonda S, Fries J, Kalyanam R, Michael AM, Caprihan A, Turner JA, Eichele T, Adelsheim S, Bryan AD, Bustillo J, Clark VP, Feldstein Ewing SW, Filbey F, Ford CC, Hutchison K, Jung RE, Kiehl KA, Kodituwakku P, Komesu YM, Mayer AR, Pearlson GD, Phillips JP, Sadek JR, Stevens M, Teuscher U, Thoma RJ, Calhoun VD (2011): A baseline for the multivariate comparison of resting‐state networks. Front Syst Neurosci 5:2. - PMC - PubMed
1. Altman DG, Bland JM (1994): Statistics Notes: Diagnostic tests 1: sensitivity and specificity. Br Med J 308:1552. - PMC - PubMed
1. Anderson JS, Druzgal TJ, Froehlich A, DuBray MB, Lange N, Alexander AL, Abildskov T, Nielsen JA, Cariello AN, Cooperrider JR, Bigler ED, Lainhart JE (2011): Decreased interhemispheric functional connectivity in autism. Cereb Cort 21:1134–1146. - PMC - PubMed
1. Ashburner J (2007): A fast diffeomorphic image registration algorithm. NeuroImage 38:95–113. - PubMed
1. Ashburner J, Friston KJ (2005): Unified segmentation. NeuroImage 26:839–851. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reproducibility of R-fMRI metrics on the impact of different strategies for multiple comparison correction and sample sizes

Affiliations

Reproducibility of R-fMRI metrics on the impact of different strategies for multiple comparison correction and sample sizes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical