Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 15;46(9):e70256.
doi: 10.1002/hbm.70256.

Big Data, Small Bias: Harmonizing Diffusion MRI-Based Structural Connectomes to Mitigate Site-Related Bias in Data Integration

Affiliations

Big Data, Small Bias: Harmonizing Diffusion MRI-Based Structural Connectomes to Mitigate Site-Related Bias in Data Integration

Rui Sherry Shen et al. Hum Brain Mapp. .

Abstract

Diffusion MRI-based structural connectomes are increasingly used to investigate brain connectivity changes associated with various disorders. However, small sample sizes in individual studies, along with highly heterogeneous disorder-related manifestations, underscore the need to pool datasets across multiple studies to be able to identify coherent and generalizable connectivity patterns linked to these disorders. Yet, combining datasets introduces site-related differences due to variations in scanner hardware or acquisition protocols. These differences highlight the necessity for statistical data harmonization to mitigate site-related effects on structural connectomes while preserving the biological information associated with participant demographics and the disorders. While several paradigms exist for harmonizing normally distributed neuroimaging measures, this paper represents the first effort to establish a harmonization framework specifically tailored for the structural connectome. We conduct a thorough investigation of various statistical harmonization methods, adapting them to accommodate the unique distributional characteristics and graph-based properties of structural connectomes. Through rigorous evaluation, we show that our MATCH algorithm, based on the gamma-distributed model, consistently outperforms existing approaches in modeling structural connectomes, enabling the effective removal of site-related biases in both edge-based and downstream graph analyses while preserving biological variability. Two real-world applications further highlight the utility of our harmonization framework in addressing challenges in multi-site structural connectome analysis. Specifically, harmonization with MATCH enhances the generalizability of connectome-based machine learning predictors to new datasets and increases statistical power for detecting group-level differences. Our work provides essential guidelines for harmonizing multi-site structural connectomes, paving the way for more robust discoveries through collaborative research in the era of team science and big data.

Keywords: ComBat; CovBat; big data; diffusion MRI; gamma generalized linear model; harmonization; multi‐site analysis; structural connectome.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following potential conflicts of interest: Timothy P.L. Roberts holds stock in Prism Clinical Imaging, has a partnership interest in Proteus Neurodynamics, and has received consulting fees from Fieldline Inc. and WestCan Proton Therapy. Russell T. Shinohara has received consulting fees from Octave Bioscience and the American Medical Association. All other authors (Rui S. Shen, Drew Parker, Andrew A. Chen, Birkan Tunc, Benjamin E. Yerys, and Ragini Verma) report no conflicts of interest.

Figures

FIGURE 1
FIGURE 1
Age distributions of participants across six study sites. For each site, the top half of the violin plot represents typically developing controls (TDC), and the bottom half represents individuals with autism spectrum disorder (ASD). The width indicates the kernel density estimate of the age distribution, and the inner sticks denote individual participant ages.
FIGURE 2
FIGURE 2
Overview of the harmonization and evaluation framework for structural connectomes. After pooling MRI data from six sites and extracting structural connectomes, we created four distinct data configurations by combining the cohorts in various ways (Step 1). Different harmonization methods were applied to each data configuration for the method comparison (Step 2). We evaluated these harmonization methods from four perspectives: validation of distributional assumptions, removal of edgewise site effects, mitigation of site effects on graph properties, and preservation of biological integrity (Step 3). Finally, we demonstrated the practical utility of structural connectome harmonization methods through two use cases, highlighting their contributions to multi‐site studies (Step 4). The data configurations used in each evaluation step are indicated by their corresponding numbers.
FIGURE 3
FIGURE 3
Validation of distributional assumptions underlying different harmonization methods for the PNC site. (A) Edgewise KS distances between the observed and hypothesized distributions in PNC. Asterisks denote significant differences in KS distances between tested harmonization methods (p < 0.0001, paired t‐test). The distributional assumption underlying MATCH framework provided a significantly better fit than those of the other harmonization models. (B) Heatmaps of edgewise KS distances, showing only the edges with significant discrepancy from the required distributional assumptions (p < 0.05, FDR‐adjusted).
FIGURE 4
FIGURE 4
Site‐related effects on mean and variance of edgewise strength in the paired PNC‐CAR cohort (Data Configuration 1). (A) MA‐plots for visualization of site differences between PNC and CAR with paired subjects. The x‐axis represents the averaged log‐transformed means across sites and the y‐axis represents between‐site differences in log‐transformed means. The horizontal line at zero indicates no site‐related effects. (B) Edgewise site effects on mean (first row) and variance (second row) connectivity strength, showing the Kruskal‐Wallis H statistics for mean site effects and F* statistics from the Brown‐Forsythe test for variance site effects. The number of edges with significant site effects was noted by e* in the top left corner of each plot.
FIGURE 5
FIGURE 5
Effect size of site differences on global graph topological measures in the paired PNC‐CAR cohort (Data Configuration 1). For each harmonization method, the Cohen's d effect sizes of sites were evaluated on six global graph topological measures (global strength, intra‐ and inter‐hemisphere strength, characteristic path length, global efficiency, modularity). Significant site effects were indicated by asterisks (p < 0.05, two‐sample t‐test).
FIGURE 6
FIGURE 6
Effect size of site differences on nodal graph topological measures in the paired PNC‐CAR cohort (Data Configuration 1). Four nodewise topological measures (node strength, betweenness centrality, local efficiency and clustering coefficient) were evaluated. Significant site effects (p < 0.05, two‐sample t‐test, FDR‐adjusted) were indicated by red cross markers. The number of nodes with significant site effects was noted by n* on the right side of each plot. The dashed line marks the significance threshold. The shaded area indicates significant site effects.
FIGURE 7
FIGURE 7
Visualization of edgewise age associations before and after applying different harmonization methods. Two scenarios were tested: The paired PNC‐CAR cohort (Data Configuration 1, top two rows) and the confounded NYU‐TCD cohort (Data Configuration 2, bottom two rows). For each scenario, (A) edgewise correlations between connectivity strength and age were shown for each site before harmonization, displaying only significant age‐associated connections (p < 0.05, Spearman R). (B) The changes in edgewise correlations with age (ΔR) after each harmonization approach were shown in histograms. (C) The CAT curves visualized the concordance of edgewise age associations before and after harmonization for each method. A CAT curve closer to one indicated better preservation of age associations.

Similar articles

References

    1. Abraham, A. , Milham M. P., di Martino A., et al. 2017. “Deriving Reproducible Biomarkers From Multi‐Site Resting‐State Data: An Autism‐Based Example.” NeuroImage 147: 736–745. - PubMed
    1. Andersson, J. L. , Skare S., and Ashburner J.. 2003. “How to Correct Susceptibility Distortions in Spin‐Echo Echo‐Planar Images: Application to Diffusion Tensor Imaging.” Neuroimage 20, no. 2: 870–888. - PubMed
    1. Andersson, J. L. , and Sotiropoulos S. N.. 2016. “An Integrated Approach to Correction for Off‐Resonance Effects and Subject Movement in Diffusion MR Imaging.” NeuroImage 125: 1063–1078. - PMC - PubMed
    1. Antunes, R. S. , da André Costa C., Küderle A., Yari I. A., and Eskofier B.. 2022. “Federated Learning for Healthcare: Systematic Review and Architecture Proposal.” ACM Transactions on Intelligent Systems and Technology 13, no. 4: 1–23.
    1. Avants, B. B. , Epstein C. L., Grossman M., and Gee J. C.. 2008. “Symmetric Diffeomorphic Image Registration With Cross‐Correlation: Evaluating Automated Labeling of Elderly and Neurodegenerative Brain.” Medical Image Analysis 12, no. 1: 26–41. - PMC - PubMed

LinkOut - more resources