Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 16;17(9):e1009279.
doi: 10.1371/journal.pcbi.1009279. eCollection 2021 Sep.

Eliminating accidental deviations to minimize generalization error and maximize replicability: Applications in connectomics and genomics

Affiliations

Eliminating accidental deviations to minimize generalization error and maximize replicability: Applications in connectomics and genomics

Eric W Bridgeford et al. PLoS Comput Biol. .

Abstract

Replicability, the ability to replicate scientific findings, is a prerequisite for scientific discovery and clinical utility. Troublingly, we are in the midst of a replicability crisis. A key to replicability is that multiple measurements of the same item (e.g., experimental sample or clinical participant) under fixed experimental constraints are relatively similar to one another. Thus, statistics that quantify the relative contributions of accidental deviations-such as measurement error-as compared to systematic deviations-such as individual differences-are critical. We demonstrate that existing replicability statistics, such as intra-class correlation coefficient and fingerprinting, fail to adequately differentiate between accidental and systematic deviations in very simple settings. We therefore propose a novel statistic, discriminability, which quantifies the degree to which an individual's samples are relatively similar to one another, without restricting the data to be univariate, Gaussian, or even Euclidean. Using this statistic, we introduce the possibility of optimizing experimental design via increasing discriminability and prove that optimizing discriminability improves performance bounds in subsequent inference tasks. In extensive simulated and real datasets (focusing on brain imaging and demonstrating on genomics), only optimizing data discriminability improves performance on all subsequent inference tasks for each dataset. We therefore suggest that designing experiments and analyses to optimize discriminability may be a crucial step in solving the replicability crisis, and more generally, mitigating accidental measurement error.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Discr provides a valid discriminability statistic.
Three simulations with characteristic notions of discriminability are constructed with n = 10 items each with s = 2 measurements. (A) The 20 samples, where color indicates the individual associated with a single measurement. (B) The distance matrices between pairs of measurements. Samples are organized by item. For each row (measurement), green boxes indicate measurements of the same item, and an orange box indicates a measurement from a different item that is more similar to the measurement than the corresponding measurement from the same item. (C) Comparison of four replicability statistics in each simulation. Row (i): Each item is most similar to a repeated measurement from the same item. All discriminability statistics are high. Row (ii): Measurements from the same item are more similar than measurements from different individuals on average, but each item has a measurement from a different item in between. ICC is essentially unchanged from (i) despite the fact that observations from the same individual are less similar than they were in (i), and both Fingerprint and Kernel are reduced by about an order of magnitude relative to simulation (i). Row (iii): Two of the ten individuals have an “outlier” measurement, and the simulation is otherwise identical to (i). ICC is negative, and Kernel provides a small statistic. Discr is the only statistic that is robust and valid across all of these simulated examples.
Fig 2
Fig 2. Multivariate simulations demonstrate the value of optimizing replicability for experimental design.
All simulations are two-dimensional, with 128 samples, with 500 iterations per setting (see S4 Text for details). (A) For each setting, class label is indicated by shape, and color indicates item identity. (B) Euclidean distance matrix between samples within each simulation setting. Samples are organized by item. Simulation settings in which items are discriminable tend to have a block structure where samples from the same item are relatively similar to one another. (C) Replicability statistic versus variance. Here, we can compute the Bayes accuracy (the best one could perform to predict class label) as a function of variance. Discr and Kernel are mostly monotonic relative to within-item variance across all settings, suggesting that one can predict improved performance via improved Discr. (D) Test of whether data are discriminable. Discr typically achieves high power among the alternative statistics in all cases. (E) Comparison test of which approach is more discriminable. Discr is the only statistic which achieves high power in all settings in which any statistic was able to achieve high power.
Fig 3
Fig 3. Different analysis strategies yield widely disparate stabilities.
(A) Illustration of analysis options for the 192 fMRI pipelines under consideration (described in S6 Text). The sequence of options corresponding to the best performing pipeline overall are in green. (B) Discr of fMRI Connectomes analyzed using 64 different pipelines. Functional correlation matrices are estimated from 28 multi-session studies from the CoRR dataset using each pipeline. The analysis strategy codes are assigned sequentially according to the abbreviations listed for each step in (A). The mean Discr per pipeline is a weighted sum of its stabilities across datasets. Each pipeline is compared to the optimal pipeline with the highest mean Discr, FNNNCP, using the above comparison hypothesis test. The remaining strategies are arranged according to p-value, indicated in the top row.
Fig 4
Fig 4. Parsing the relative impact on Discr of various acquisition and analytic choices.
(A) The pipelines are aggregated for a particular analysis step, with pairwise comparisons with the remaining analysis options held fixed. The beeswarm plot shows the difference between the overall best performing option and the second best option for each stage (mean in red) with other options held equal; the x-axis label indicates the best performing strategy. The best strategies are FNIRT, no frequency filtering, no scrubbing, global signal regression, the CC200 parcellation, and ranks edge transformation. A Wilcoxon signed-rank test is used to determine whether the mean for the best strategy exceeds the second best strategy: a * indicates that the p-value is at most 0.001 after Bonferroni correction. Of the best options, only no scrubbing is not significantly better than alternative strategies. Note that the options that perform marginally the best are not significantly different than the best performing strategy overall, as shown in Fig 3. (B) A comparison of the stabilities for the 4 datasets with both fMRI and dMRI connectomes. dMRI connectomes tend to be more discriminable, in 14 of 20 total comparisons. Color and point size correspond to the study and number of scans, respectively (see Fig 3B). (C.i) Comparing raw edge weights (Raw), ranking (Rank), and log-transforming the edge-weights (Log) for the diffusion connectomes, the Log and Rank transformed edge-weights tend to show higher Discr than Raw. (C.ii) As the number of ROIs increases, the Discr tends to increase.
Fig 5
Fig 5. Optimizing Discr improves downstream inference performance.
Using the connectomes from the 64 pipelines with raw edge-weights, we examine the relationship between connectomes vs sex and age. The columns evaluate difference approaches for computing pipeline effectiveness, including (i) Discr, (ii) PICC, (iii) Average Fingerprint Index Fingerprint, (iv) I2C2, and (v) Kernel. Each panel shows reference pipeline replicability estimate (x-axis) versus effect size of the association between the data and the sex, age, or cancer status of the individual as measured by DCorr (y-axis). Both the x and y axes are normalized by the minimum and maximum statistic. These data are summarized by a single line per study, which is the regression of the normalized effect size onto the normalized replicability estimate as quantified by the indicated reference statistic. (I) The results for the neuroimaging data, as described in Section 3.4. Color and line width correspond to the study and number of scans, respectively (see Fig 3B). The solid black line is the weighted mean over all studies. Discr is the only statistic in which nearly all slopes are positive. Moreover, the corrected p-value [51, 52] is significant across most datasets for both covariates (3944.89 p-values < .001). This indicates that pipelines with higher Discr correspond to larger effect sizes for the covariate of interest, and that this relationship is stronger for Discr than other statistics. A similar experiment is performed on two genomics datasets, measuring the effects due to sex and whether an individual has cancer. (III) indicates the fraction of datasets with positive slopes and with significantly positive slopes, ranging from 0 (“None”, red) to 1 (“All”, green), at both the task and aggregate level. Discr is the statistic where the most datasets have positive slopes, and the statistic where the most datasets have significantly positive slopes, across the neuroimaging and genomics datasets considered. S6 Text details the methodologies employed.

References

    1. Spearman C. The Proof and Measurement of Association between Two Things. Am J Psychol. 1904. Jan;15(1):72. doi: 10.2307/1412159 - DOI - PubMed
    1. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010. Oct;11(10):733–9. doi: 10.1038/nrg2825 - DOI - PMC - PubMed
    1. Leek JT, Peng RD. Statistics: P values are just the tip of the iceberg. Nature. 2015. Apr;520(7549):612. doi: 10.1038/520612a - DOI - PubMed
    1. National Academies of Sciences E. Reproducibility and Replicability in Science; 2019. - PubMed
    1. Goodman SN, Fanelli D, Ioannidis JPA. What does research reproducibility mean? Sci Transl Med. 2016. Jun;8(341):341ps12. doi: 10.1126/scitranslmed.aaf5027 - DOI - PubMed

Publication types