This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Oct 30:2023.10.25.563971.

doi: 10.1101/2023.10.25.563971.

Power and reproducibility in the external validation of brain-phenotype predictions

Matthew Rosenblatt¹, Link Tejavibulya², Chris C Camp², Rongtao Jiang³, Margaret L Westwater³, Stephanie Noble^{3

4

5}, Dustin Scheinost^{1

2

3

6

7}

Affiliations

¹ Department of Biomedical Engineering, Yale University, New Haven, CT.
² Interdepartmental Neuroscience Program, Yale University, New Haven, CT.
³ Department of Radiology & Biomedical Imaging, Yale School of Medicine, New Haven, CT.
⁴ Department of Bioengineering, Northeastern University, Boston, MA.
⁵ Department of Psychology, Northeastern University, Boston, MA.
⁶ Child Study Center, Yale School of Medicine, New Haven, CT.
⁷ Department of Statistics & Data Science, Yale University, New Haven, CT.

PMID: 37961654
PMCID: PMC10634903
DOI: 10.1101/2023.10.25.563971

Power and reproducibility in the external validation of brain-phenotype predictions

Matthew Rosenblatt et al. bioRxiv. 2023.

[Preprint]. 2023 Oct 30:2023.10.25.563971.

doi: 10.1101/2023.10.25.563971.

Authors

Matthew Rosenblatt¹, Link Tejavibulya², Chris C Camp², Rongtao Jiang³, Margaret L Westwater³, Stephanie Noble^{3

4

5}, Dustin Scheinost^{1

2

3

6

7}

Affiliations

¹ Department of Biomedical Engineering, Yale University, New Haven, CT.
² Interdepartmental Neuroscience Program, Yale University, New Haven, CT.
³ Department of Radiology & Biomedical Imaging, Yale School of Medicine, New Haven, CT.
⁴ Department of Bioengineering, Northeastern University, Boston, MA.
⁵ Department of Psychology, Northeastern University, Boston, MA.
⁶ Child Study Center, Yale School of Medicine, New Haven, CT.
⁷ Department of Statistics & Data Science, Yale University, New Haven, CT.

PMID: 37961654
PMCID: PMC10634903
DOI: 10.1101/2023.10.25.563971

Update in

Power and reproducibility in the external validation of brain-phenotype predictions.
Rosenblatt M, Tejavibulya L, Sun H, Camp CC, Khaitova M, Adkinson BD, Jiang R, Westwater ML, Noble S, Scheinost D. Rosenblatt M, et al. Nat Hum Behav. 2024 Oct;8(10):2018-2033. doi: 10.1038/s41562-024-01931-7. Epub 2024 Jul 31. Nat Hum Behav. 2024. PMID: 39085406

Abstract

Identifying reproducible and generalizable brain-phenotype associations is a central goal of neuroimaging. Consistent with this goal, prediction frameworks evaluate brain-phenotype models in unseen data. Most prediction studies train and evaluate a model in the same dataset. However, external validation, or the evaluation of a model in an external dataset, provides a better assessment of robustness and generalizability. Despite the promise of external validation and calls for its usage, the statistical power of such studies has yet to be investigated. In this work, we ran over 60 million simulations across several datasets, phenotypes, and sample sizes to better understand how the sizes of the training and external datasets affect statistical power. We found that prior external validation studies used sample sizes prone to low power, which may lead to false negatives and effect size inflation. Furthermore, increases in the external sample size led to increased simulated power directly following theoretical power curves, whereas changes in the training dataset size offset the simulated power curves. Finally, we compared the performance of a model within a dataset to the external performance. The within-dataset performance was typically within r=0.2 of the cross-dataset performance, which could help decide how to power future external validation studies. Overall, our results illustrate the importance of considering the sample sizes of both the training and external datasets when performing external validation.

PubMed Disclaimer

Figures

**Figure 1.**
Within-dataset held-out prediction performance in HBN for age, attention problems, and matrix reasoning. The performance was evaluated in a randomly selected held-out sample of size n=200. The error bars show the 2.5^th and 97.5^th percentiles among 100 repeats of resampling at each training sample size. The dotted line reflects the correlation value required for a significance level of p<0.05. Similar results were observed for the ABCD, HCPD, and PNC datasets; see Figures S2–3. AP: attention problems, MR: matrix reasoning.

**Figure 2.**
Power and false positive rates for cross-dataset predictions, training in HBN and testing in ABCD (top row), HCPD (middle row), or PNC (bottom row) for prediction of age (left column), attention problems (middle column), or matrix reasoning (right column). The blue lines represent theoretical power assuming a known ground truth performance. The panel with N/A means that data were not included in this study. Similar results were observed for the ABCD, HCPD, and PNC datasets; see Figure S4. AP: attention problems, MR: matrix reasoning.

**Figure 3.**
Median effect size inflation for cross-dataset predictions, training in HBN and testing in ABCD (top row), HCPD (middle row), or PNC (bottom row) for prediction of age (left column), attention (middle column), or matrix reasoning (right column). Panels with N/A mean that data were not available. Similar results were observed for the ABCD, HCPD, and PNC datasets; see Figure S5. AP: attention problems, MR: matrix reasoning.

**Figure 4.**
Boxplots of the difference between internal and external performance for each subsample of the training data. For each training data size, 100 random subsamples were taken. The model was evaluated for internal performance in a held-out sample of size n=200. For external performance, the model formed in the training subsample was applied to the full external dataset. Panels with N/A mean that data were not available. Similar results were observed for the ABCD, HCPD, and PNC datasets; see Figure S6. AP: attention problems, MR: matrix reasoning.

See this image and copyright information in PMC

References

1. Alexander L.M. et al. (2017) ‘An open resource for transdiagnostic research in pediatric mental health and learning disorders’, Scientific data, 4, p. 170181. - PMC - PubMed
1. Benkarim O. et al. (2021) ‘The Cost of Untracked Diversity in Brain-Imaging Prediction’, bioRxiv. Available at: 10.1101/2021.06.16.448764. - DOI
1. Button K.S. et al. (2013) ‘Power failure: why small sample size undermines the reliability of neuroscience’, Nature reviews. Neuroscience, 14(5), pp. 365–376. - PubMed
1. Casey B.J. et al. (2018) ‘The Adolescent Brain Cognitive Development (ABCD) study: Imaging acquisition across 21 sites’, Developmental cognitive neuroscience, 32, pp. 43–54. - PMC - PubMed
1. Chandler C., Foltz P.W. and Elvevåg B. (2020) ‘Using Machine Learning in Psychiatry: The Need to Establish a Framework That Nurtures Trustworthiness’, Schizophrenia bulletin, 46(1), pp. 11–14. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Power and reproducibility in the external validation of brain-phenotype predictions

Affiliations

Power and reproducibility in the external validation of brain-phenotype predictions

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous