. 2025 Jun 15;46(9):e70256.

doi: 10.1002/hbm.70256.

Big Data, Small Bias: Harmonizing Diffusion MRI-Based Structural Connectomes to Mitigate Site-Related Bias in Data Integration

Rui Sherry Shen^{1

2}, Drew Parker², Andrew An Chen³, Benjamin E Yerys^{4

5

6}, Birkan Tunç^{4

5}, Timothy P L Roberts⁷, Russell T Shinohara⁸, Ragini Verma²

Affiliations

¹ Department of Bioengineering, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
² Diffusion & Connectomics in Precision Healthcare Research, Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
³ Department of Public Health Sciences, Medical University of South Carolina, Charleston, South Carolina, USA.
⁴ Center for Autism Research, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
⁵ Department of Child and Adolescent Psychiatry and Behavioral Science, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
⁶ Advancing Transition and Learning for Adult Success Center, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
⁷ Program in Advanced Imaging Research, Department of Radiology, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
⁸ Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

PMID: 40563239
PMCID: PMC12198055
DOI: 10.1002/hbm.70256

Big Data, Small Bias: Harmonizing Diffusion MRI-Based Structural Connectomes to Mitigate Site-Related Bias in Data Integration

Rui Sherry Shen et al. Hum Brain Mapp. 2025.

. 2025 Jun 15;46(9):e70256.

doi: 10.1002/hbm.70256.

Authors

Rui Sherry Shen^{1

2}, Drew Parker², Andrew An Chen³, Benjamin E Yerys^{4

5

6}, Birkan Tunç^{4

5}, Timothy P L Roberts⁷, Russell T Shinohara⁸, Ragini Verma²

Affiliations

¹ Department of Bioengineering, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
² Diffusion & Connectomics in Precision Healthcare Research, Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
³ Department of Public Health Sciences, Medical University of South Carolina, Charleston, South Carolina, USA.
⁴ Center for Autism Research, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
⁵ Department of Child and Adolescent Psychiatry and Behavioral Science, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
⁶ Advancing Transition and Learning for Adult Success Center, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
⁷ Program in Advanced Imaging Research, Department of Radiology, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
⁸ Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

PMID: 40563239
PMCID: PMC12198055
DOI: 10.1002/hbm.70256

Abstract

Diffusion MRI-based structural connectomes are increasingly used to investigate brain connectivity changes associated with various disorders. However, small sample sizes in individual studies, along with highly heterogeneous disorder-related manifestations, underscore the need to pool datasets across multiple studies to be able to identify coherent and generalizable connectivity patterns linked to these disorders. Yet, combining datasets introduces site-related differences due to variations in scanner hardware or acquisition protocols. These differences highlight the necessity for statistical data harmonization to mitigate site-related effects on structural connectomes while preserving the biological information associated with participant demographics and the disorders. While several paradigms exist for harmonizing normally distributed neuroimaging measures, this paper represents the first effort to establish a harmonization framework specifically tailored for the structural connectome. We conduct a thorough investigation of various statistical harmonization methods, adapting them to accommodate the unique distributional characteristics and graph-based properties of structural connectomes. Through rigorous evaluation, we show that our MATCH algorithm, based on the gamma-distributed model, consistently outperforms existing approaches in modeling structural connectomes, enabling the effective removal of site-related biases in both edge-based and downstream graph analyses while preserving biological variability. Two real-world applications further highlight the utility of our harmonization framework in addressing challenges in multi-site structural connectome analysis. Specifically, harmonization with MATCH enhances the generalizability of connectome-based machine learning predictors to new datasets and increases statistical power for detecting group-level differences. Our work provides essential guidelines for harmonizing multi-site structural connectomes, paving the way for more robust discoveries through collaborative research in the era of team science and big data.

Keywords: ComBat; CovBat; big data; diffusion MRI; gamma generalized linear model; harmonization; multi‐site analysis; structural connectome.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following potential conflicts of interest: Timothy P.L. Roberts holds stock in Prism Clinical Imaging, has a partnership interest in Proteus Neurodynamics, and has received consulting fees from Fieldline Inc. and WestCan Proton Therapy. Russell T. Shinohara has received consulting fees from Octave Bioscience and the American Medical Association. All other authors (Rui S. Shen, Drew Parker, Andrew A. Chen, Birkan Tunc, Benjamin E. Yerys, and Ragini Verma) report no conflicts of interest.

Figures

**FIGURE 1**
Age distributions of participants across six study sites. For each site, the top half of the violin plot represents typically developing controls (TDC), and the bottom half represents individuals with autism spectrum disorder (ASD). The width indicates the kernel density estimate of the age distribution, and the inner sticks denote individual participant ages.

**FIGURE 2**
Overview of the harmonization and evaluation framework for structural connectomes. After pooling MRI data from six sites and extracting structural connectomes, we created four distinct data configurations by combining the cohorts in various ways (Step 1). Different harmonization methods were applied to each data configuration for the method comparison (Step 2). We evaluated these harmonization methods from four perspectives: validation of distributional assumptions, removal of edgewise site effects, mitigation of site effects on graph properties, and preservation of biological integrity (Step 3). Finally, we demonstrated the practical utility of structural connectome harmonization methods through two use cases, highlighting their contributions to multi‐site studies (Step 4). The data configurations used in each evaluation step are indicated by their corresponding numbers.

**FIGURE 3**
Validation of distributional assumptions underlying different harmonization methods for the PNC site. (A) Edgewise KS distances between the observed and hypothesized distributions in PNC. Asterisks denote significant differences in KS distances between tested harmonization methods (p < 0.0001, paired t‐test). The distributional assumption underlying MATCH framework provided a significantly better fit than those of the other harmonization models. (B) Heatmaps of edgewise KS distances, showing only the edges with significant discrepancy from the required distributional assumptions (p < 0.05, FDR‐adjusted).

**FIGURE 4**
Site‐related effects on mean and variance of edgewise strength in the paired PNC‐CAR cohort (Data Configuration 1). (A) MA‐plots for visualization of site differences between PNC and CAR with paired subjects. The x‐axis represents the averaged log‐transformed means across sites and the y‐axis represents between‐site differences in log‐transformed means. The horizontal line at zero indicates no site‐related effects. (B) Edgewise site effects on mean (first row) and variance (second row) connectivity strength, showing the Kruskal‐Wallis H statistics for mean site effects and F* statistics from the Brown‐Forsythe test for variance site effects. The number of edges with significant site effects was noted by e* in the top left corner of each plot.

**FIGURE 5**
Effect size of site differences on global graph topological measures in the paired PNC‐CAR cohort (Data Configuration 1). For each harmonization method, the Cohen's d effect sizes of sites were evaluated on six global graph topological measures (global strength, intra‐ and inter‐hemisphere strength, characteristic path length, global efficiency, modularity). Significant site effects were indicated by asterisks (p < 0.05, two‐sample t‐test).

**FIGURE 6**
Effect size of site differences on nodal graph topological measures in the paired PNC‐CAR cohort (Data Configuration 1). Four nodewise topological measures (node strength, betweenness centrality, local efficiency and clustering coefficient) were evaluated. Significant site effects (p < 0.05, two‐sample t‐test, FDR‐adjusted) were indicated by red cross markers. The number of nodes with significant site effects was noted by n* on the right side of each plot. The dashed line marks the significance threshold. The shaded area indicates significant site effects.

**FIGURE 7**
Visualization of edgewise age associations before and after applying different harmonization methods. Two scenarios were tested: The paired PNC‐CAR cohort (Data Configuration 1, top two rows) and the confounded NYU‐TCD cohort (Data Configuration 2, bottom two rows). For each scenario, (A) edgewise correlations between connectivity strength and age were shown for each site before harmonization, displaying only significant age‐associated connections (p < 0.05, Spearman R). (B) The changes in edgewise correlations with age ( $ΔR$ ) after each harmonization approach were shown in histograms. (C) The CAT curves visualized the concordance of edgewise age associations before and after harmonization for each method. A CAT curve closer to one indicated better preservation of age associations.

See this image and copyright information in PMC

References

1. Abraham, A. , Milham M. P., di Martino A., et al. 2017. “Deriving Reproducible Biomarkers From Multi‐Site Resting‐State Data: An Autism‐Based Example.” NeuroImage 147: 736–745. - PubMed
1. Andersson, J. L. , Skare S., and Ashburner J.. 2003. “How to Correct Susceptibility Distortions in Spin‐Echo Echo‐Planar Images: Application to Diffusion Tensor Imaging.” Neuroimage 20, no. 2: 870–888. - PubMed
1. Andersson, J. L. , and Sotiropoulos S. N.. 2016. “An Integrated Approach to Correction for Off‐Resonance Effects and Subject Movement in Diffusion MR Imaging.” NeuroImage 125: 1063–1078. - PMC - PubMed
1. Antunes, R. S. , da André Costa C., Küderle A., Yari I. A., and Eskofier B.. 2022. “Federated Learning for Healthcare: Systematic Review and Architecture Proposal.” ACM Transactions on Intelligent Systems and Technology 13, no. 4: 1–23.
1. Avants, B. B. , Epstein C. L., Grossman M., and Gee J. C.. 2008. “Symmetric Diffeomorphic Image Registration With Cross‐Correlation: Evaluating Automated Labeling of Elderly and Neurodegenerative Brain.” Medical Image Analysis 12, no. 1: 26–41. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 MH117807/MH/NIMH NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Big Data, Small Bias: Harmonizing Diffusion MRI-Based Structural Connectomes to Mitigate Site-Related Bias in Data Integration

Affiliations

Big Data, Small Bias: Harmonizing Diffusion MRI-Based Structural Connectomes to Mitigate Site-Related Bias in Data Integration

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources