Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Oct 30:2023.10.25.563971.
doi: 10.1101/2023.10.25.563971.

Power and reproducibility in the external validation of brain-phenotype predictions

Affiliations

Power and reproducibility in the external validation of brain-phenotype predictions

Matthew Rosenblatt et al. bioRxiv. .

Update in

Abstract

Identifying reproducible and generalizable brain-phenotype associations is a central goal of neuroimaging. Consistent with this goal, prediction frameworks evaluate brain-phenotype models in unseen data. Most prediction studies train and evaluate a model in the same dataset. However, external validation, or the evaluation of a model in an external dataset, provides a better assessment of robustness and generalizability. Despite the promise of external validation and calls for its usage, the statistical power of such studies has yet to be investigated. In this work, we ran over 60 million simulations across several datasets, phenotypes, and sample sizes to better understand how the sizes of the training and external datasets affect statistical power. We found that prior external validation studies used sample sizes prone to low power, which may lead to false negatives and effect size inflation. Furthermore, increases in the external sample size led to increased simulated power directly following theoretical power curves, whereas changes in the training dataset size offset the simulated power curves. Finally, we compared the performance of a model within a dataset to the external performance. The within-dataset performance was typically within r=0.2 of the cross-dataset performance, which could help decide how to power future external validation studies. Overall, our results illustrate the importance of considering the sample sizes of both the training and external datasets when performing external validation.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Within-dataset held-out prediction performance in HBN for age, attention problems, and matrix reasoning. The performance was evaluated in a randomly selected held-out sample of size n=200. The error bars show the 2.5th and 97.5th percentiles among 100 repeats of resampling at each training sample size. The dotted line reflects the correlation value required for a significance level of p<0.05. Similar results were observed for the ABCD, HCPD, and PNC datasets; see Figures S2–3. AP: attention problems, MR: matrix reasoning.
Figure 2.
Figure 2.
Power and false positive rates for cross-dataset predictions, training in HBN and testing in ABCD (top row), HCPD (middle row), or PNC (bottom row) for prediction of age (left column), attention problems (middle column), or matrix reasoning (right column). The blue lines represent theoretical power assuming a known ground truth performance. The panel with N/A means that data were not included in this study. Similar results were observed for the ABCD, HCPD, and PNC datasets; see Figure S4. AP: attention problems, MR: matrix reasoning.
Figure 3.
Figure 3.
Median effect size inflation for cross-dataset predictions, training in HBN and testing in ABCD (top row), HCPD (middle row), or PNC (bottom row) for prediction of age (left column), attention (middle column), or matrix reasoning (right column). Panels with N/A mean that data were not available. Similar results were observed for the ABCD, HCPD, and PNC datasets; see Figure S5. AP: attention problems, MR: matrix reasoning.
Figure 4.
Figure 4.
Boxplots of the difference between internal and external performance for each subsample of the training data. For each training data size, 100 random subsamples were taken. The model was evaluated for internal performance in a held-out sample of size n=200. For external performance, the model formed in the training subsample was applied to the full external dataset. Panels with N/A mean that data were not available. Similar results were observed for the ABCD, HCPD, and PNC datasets; see Figure S6. AP: attention problems, MR: matrix reasoning.

References

    1. Alexander L.M. et al. (2017) ‘An open resource for transdiagnostic research in pediatric mental health and learning disorders’, Scientific data, 4, p. 170181. - PMC - PubMed
    1. Benkarim O. et al. (2021) ‘The Cost of Untracked Diversity in Brain-Imaging Prediction’, bioRxiv. Available at: 10.1101/2021.06.16.448764. - DOI
    1. Button K.S. et al. (2013) ‘Power failure: why small sample size undermines the reliability of neuroscience’, Nature reviews. Neuroscience, 14(5), pp. 365–376. - PubMed
    1. Casey B.J. et al. (2018) ‘The Adolescent Brain Cognitive Development (ABCD) study: Imaging acquisition across 21 sites’, Developmental cognitive neuroscience, 32, pp. 43–54. - PMC - PubMed
    1. Chandler C., Foltz P.W. and Elvevåg B. (2020) ‘Using Machine Learning in Psychiatry: The Need to Establish a Framework That Nurtures Trustworthiness’, Schizophrenia bulletin, 46(1), pp. 11–14. - PMC - PubMed

Publication types