Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 21:45:e1.
doi: 10.1017/S0140525X20001685.

The generalizability crisis

Affiliations

The generalizability crisis

Tal Yarkoni. Behav Brain Sci. .

Abstract

Most theories and hypotheses in psychology are verbal in nature, yet their evaluation overwhelmingly relies on inferential statistical procedures. The validity of the move from qualitative to quantitative analysis depends on the verbal and statistical expressions of a hypothesis being closely aligned - that is, that the two must refer to roughly the same set of hypothetical observations. Here, I argue that many applications of statistical inference in psychology fail to meet this basic condition. Focusing on the most widely used class of model in psychology - the linear mixed model - I explore the consequences of failing to statistically operationalize verbal hypotheses in a way that respects researchers' actual generalization intentions. I demonstrate that although the "random effect" formalism is used pervasively in psychology to model intersubject variability, few researchers accord the same treatment to other variables they clearly intend to generalize over (e.g., stimuli, tasks, or research sites). The under-specification of random effects imposes far stronger constraints on the generalizability of results than most researchers appreciate. Ignoring these constraints can dramatically inflate false-positive rates, and often leads researchers to draw sweeping verbal generalizations that lack a meaningful connection to the statistical quantities they are putatively based on. I argue that failure to take the alignment between verbal and statistical expressions seriously lies at the heart of many of psychology's ongoing problems (e.g., the replication crisis), and conclude with a discussion of several potential avenues for improvement.

Keywords: Generalization; inference; philosophy of science; psychology; random effects; statistics.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest. None.

Figures

Figure 1.
Figure 1.
Consequences of mismatch between model specification and generalization intention. Each row represents a simulated Stroop experiment with n = 20 new subjects randomly drawn from the same global population (the ground truth for all parameters is constant over all experiments). Bars display the estimated Bayesian 95% highest posterior density (HPD) intervals for the (fixed) condition effect of interest in each experiment. Experiments are ordered by the magnitude of the point estimate for visual clarity. (A) The fixed-effects model specification in Eq. (1) does not account for random subject sampling, and consequently underestimates the uncertainty associated with the effect of interest. (B) The random-effects specification in Eq. (2) takes subject sampling into account, and produces appropriately calibrated uncertainty estimates.
Figure 2.
Figure 2.
Effects of unmeasured variance components on the putative “verbal overshadowing” effect. Error bars display the estimated Bayesian 95% highest posterior density (HPD) intervals for the experimental effect reported in Alogna et al. (2014). Positive estimates indicate better performance in the control condition than in the experimental condition. Each row represents the estimate from the model specified in Eq. (4), with only the size of σunmeasured2 (corresponding to σu22 in Eq. (4)) varying as indicated. This parameter represents the assumed contribution of all variance components that are unmeasured in the experiment, but fall within the universe of intended generalization conceptually. The top row (σu22=0) can be interpreted as a conventional model analogous to the one reported in Alogna et al. (2014) – that is, it assumes that no unmeasured sources have any impact on the putative verbal overshadowing effect.

Comment in

References

    1. Acosta A, Adams RB Jr., Albohn DN, Allard ES, Beek T, Benning SD, … Zwaan RA (2016). Registered replication report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11(6), 917–928. - PubMed
    1. Alogna VK, Attaya MK, Aucoin P, Bahník Š, Birch S, Birt AR, … Zwaan RA (2014). Registered replication report: Schooler and Engstler-Schooler (1990). Perspectives on Psychological Science, 9(5), 556–578. - PubMed
    1. Baayen RH, Davidson DJ, & Bates DM (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59 (4), 390–412.
    1. Balota DA, Yap MJ, Hutchison KA, & Cortese MJ (2012). Megastudies: What do millions (or so) of trials tell us about lexical processing? In Adelman JS (Ed.), Visual word recognition volume 1: Models and methods, orthography and phonology (pp. 90–115). Psychology Press.
    1. Baribault B, Donkin C, Little DR, Trueblood JS, Oravecz Z, van Ravenzwaaij D, … Vandekerckhove J (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences of the United States of America, 115(11), 2607–2612. - PMC - PubMed