Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Nov-Dec;61(6):593-601.
doi: 10.4103/0019-5154.193662.

Biostatistics Series Module 6: Correlation and Linear Regression

Affiliations

Biostatistics Series Module 6: Correlation and Linear Regression

Avijit Hazra et al. Indian J Dermatol. 2016 Nov-Dec.

Abstract

Correlation and linear regression are the most commonly used techniques for quantifying the association between two numeric variables. Correlation quantifies the strength of the linear relationship between paired variables, expressing this as a correlation coefficient. If both variables x and y are normally distributed, we calculate Pearson's correlation coefficient (r). If normality assumption is not met for one or both variables in a correlation analysis, a rank correlation coefficient, such as Spearman's rho (ρ) may be calculated. A hypothesis test of correlation tests whether the linear relationship between the two variables holds in the underlying population, in which case it returns a P < 0.05. A 95% confidence interval of the correlation coefficient can also be calculated for an idea of the correlation in the population. The value r2 denotes the proportion of the variability of the dependent variable y that can be attributed to its linear relation with the independent variable x and is called the coefficient of determination. Linear regression is a technique that attempts to link two correlated variables x and y in the form of a mathematical equation (y = a + bx), such that given the value of one variable the other may be predicted. In general, the method of least squares is applied to obtain the equation of the regression line. Correlation and linear regression analysis are based on certain assumptions pertaining to the data sets. If these assumptions are not met, misleading conclusions may be drawn. The first assumption is that of linear relationship between the two variables. A scatter plot is essential before embarking on any correlation-regression analysis to show that this is indeed the case. Outliers or clustering within data sets can distort the correlation coefficient value. Finally, it is vital to remember that though strong correlation can be a pointer toward causation, the two are not synonymous.

Keywords: Bland–Altman plot; Pearson's r; Spearman's rho; correlation; correlation coefficient; intraclass correlation coefficient; method of least squares; point biserial correlation coefficient; regression.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts of interest.

Figures

Figure 1
Figure 1
Scatter diagram depicting direct and inverse linear relationships
Figure 2
Figure 2
Scatter diagram depicting a curvilinear relationship
Figure 3
Figure 3
Scatter diagram depicting relationship patterns between two variables
Figure 4
Figure 4
The principle of the method of least squares for linear regression. The sum of the squared “Residuals” is the least for the line of best fit
Figure 5
Figure 5
Examples of misleading correlations
Figure 6
Figure 6
Example of a Bland–Altman plot used to compare two test methods. The bias line with the limits of agreement is provided
Figure 7
Figure 7
Example of a Bland–Altman plot showing proportional bias. In this case, the difference between the methods first tends to narrow down and then increase as the value of measurements increase

References

    1. Samuels MA, Witmer JA, Schaffner AA, editors. Statistics for the life sciences. 4th ed. Boston: Pearson Education; 2012. Linear regression and correlation; pp. 493–549.
    1. Kirk RE, editor. Statistics: An introduction. 5th ed. Belmont: Thomson Wadsworth; 2008. Correlation; pp. 123–57.
    1. Kirk RE, editor. Statistics: An introduction. 5th ed. Belmont: Thomson Wadsworth; 2001. Regression; pp. 159–81.
    1. Glaser AN, editor. High-yield biostatistics. Baltimore: Lippincott Williams and Wilkins; 2001. Correlational techniques; pp. 50–7.
    1. Bewick V, Cheek L, Ball J. Statistics review 7: Correlation and regression. Crit Care. 2003;7:451–9. - PMC - PubMed