Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan;28(1):47-53.
doi: 10.1097/EDE.0000000000000554.

Collinearity and Causal Diagrams: A Lesson on the Importance of Model Specification

Affiliations

Collinearity and Causal Diagrams: A Lesson on the Importance of Model Specification

Enrique F Schisterman et al. Epidemiology. 2017 Jan.

Abstract

Background: Correlated data are ubiquitous in epidemiologic research, particularly in nutritional and environmental epidemiology where mixtures of factors are often studied. Our objectives are to demonstrate how highly correlated data arise in epidemiologic research and provide guidance, using a directed acyclic graph approach, on how to proceed analytically when faced with highly correlated data.

Methods: We identified three fundamental structural scenarios in which high correlation between a given variable and the exposure can arise: intermediates, confounders, and colliders. For each of these scenarios, we evaluated the consequences of increasing correlation between the given variable and the exposure on the bias and variance for the total effect of the exposure on the outcome using unadjusted and adjusted models. We derived closed-form solutions for continuous outcomes using linear regression and empirically present our findings for binary outcomes using logistic regression.

Results: For models properly specified, total effect estimates remained unbiased even when there was almost perfect correlation between the exposure and a given intermediate, confounder, or collider. In general, as the correlation increased, the variance of the parameter estimate for the exposure in the adjusted models increased, while in the unadjusted models, the variance increased to a lesser extent or decreased.

Conclusion: Our findings highlight the importance of considering the causal framework under study when specifying regression models. Strategies that do not take into consideration the causal structure may lead to biased effect estimation for the original question of interest, even under high correlation.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest to disclose.

Figures

Figure 1
Figure 1
Expected value, bias, and variance of parameter estimates from linear regression models with β1 and β2 representing unadjusted and adjusted effect estimates of exposure on outcome, respectively. a, b, and γ represent direct effects of standardized factors, ρ represents both correlation and direct effect between E and I/C n= number of observations
Figure 2
Figure 2
Relative bias and standard error for parameter estimates β1 (unadjusted model) and β2 (adjusted model) with increased correlation ρ between the given variable and exposure for continuous outcomes. SE, standard error See Figure 1 for diagrams of causal structures. All other associations between variables were held fixed (see appendix for details). Relative bias is in reference to 0 (no bias). Relative SE is calculated in reference to SE for each parameter estimate when the correlation between the given variable and exposure is 0.1.
Figure 3
Figure 3
Point estimates for b (effect of C on D) and β2 (effect of E on D) for a null effect under Structure 2 (Confounding) under various levels of confounding (b = 0.01, 0.1, 0.5) and correlation (top row: 0.990, bottom row: 0.999); a significant estimate for β2 represents a Type I Error. SE, standard error See Figure 1 for diagrams of causal structures. The relationship between the exposure, confounder, and confounder of the collider-outcome relationship is represented by b for Structures 1–3, respectively, and was held at two values: 0.5 (weaker) and 1.0 (stronger). All other associations between variables were held fixed (see appendix for details). Relative bias is in reference to 0 (no bias). Relative SE is calculated in reference to SE for each parameter estimate when the correlation between the given variable and exposure is 0.1.
Figure 4
Figure 4
Relative bias and standard error for parameter estimates β1 (unadjusted model) and β2 (adjusted model) with increased correlation ρ between the given variable and exposure for binary outcomes.

References

    1. Kleinbaum DGKLL, Muller KE, Nizam A. Applied regression analysis and other multivariable methods. 3rd. Pacific Grove: Duxbury Press; 1998. pp. 237–248.
    1. Hernan MA, Hernandez-Diaz S, Werler MM, Mitchell AA. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. Am J Epidemiol. 2002;155:176–84. - PubMed
    1. Schisterman EF, Cole SR, Platt RW. Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology. 2009;20:488–95. - PMC - PubMed
    1. Hernan MA, Hernandez-Diaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15:615–25. - PubMed
    1. Whitcomb BW, Schisterman EF, Perkins NJ, Platt RW. Quantification of colliderstratification bias and the birthweight paradox. Paediatr Perinat Epidemiol. 2009;23:394–402. - PMC - PubMed