Tabla de Contenidos
When statistically analyzing series of quantitative data, we are often faced with paired data or ordered pairs. These correspond to data of two different variables, generally coming from the same individual and that, therefore, are linked to each other. It is then a matter of data that is not considered separately, but must always be considered together, such as the height and weight of a particular individual, or the weight and maximum speed of a car.
When we have paired data, statistics provide us with the possibility of establishing whether there is a relationship between these variables. This is particularly common in the different sciences, especially when it is observed that the behavior of one variable seems to affect or determine the behavior of another. When establishing these relationships, statistics provides us with two different types of tools: correlation studies between two or more variables and the adjustment of paired data to different mathematical models through a regression process.
For data that behaves linearly, a linear regression coefficient, r , can be calculated that measures how linearly the data behaves. On the other hand, the mathematical equation of the straight line that best fits the data can also be obtained through linear regression. When we do this, we get the regression coefficients in the form of the intercept of the line and its slope.
If we look at many examples of calculations of linear regression coefficients and of the slope of the line obtained by linear regression, we will quickly notice that there is a relationship between both values. In particular, we will note that whenever the slope is negative, the regression coefficient is also negative; when it is positive the coefficient is also positive and when the slope is zero, so is the regression coefficient.
In the following sections we will explore why this happens and what is the real relationship between these two statistical values that almost always go hand in hand.
Correlation and regression in statistics and science
Correlation studies provide a series of statistics such as the correlation and determination coefficients, which make it possible to establish how correlated two or more variables are with each other. In other words, they allow us to establish what proportion of the variability of a random variable (usually quantitative) can be explained in terms of the variability of another random variable, instead of being explained in terms of its own random variations. This means that they allow establishing how well the variation of one or more variables explains the variation of another.
It should be noted that correlation studies only see that, the correlation between two or more variables, but they do not provide direct evidence of cause and effect (that is, they do not allow establishing which of the two variables causes the variation of the other).
On the other hand, when we know (through a correlation study) or intuit that two variables are correlated in some way, we generally seek to establish a mathematical model that allows us to represent the general behavior of one variable as a function of the other, allowing thus predicting the value of one of the variables based on the value of the other. This is achieved thanks to a regression process through which the coefficients of a mathematical model that minimize the differences between the observed data (the ordered pairs or paired data) and the values predicted by the model are calculated.
Linear Correlation and Pearson’s Correlation Coefficient
The simplest case of correlation is linear correlation. This occurs when there is a linear relationship between two quantitative variables in such a way that, when one of them increases, the other either always increases in the same proportion, or always decreases in the same proportion.
Linear correlation studies are based on calculating the linear correlation coefficient for the data series. There are several different linear correlation coefficients that can be calculated, the most common of which are:
- Pearson’s linear correlation coefficient
- Spearman’s linear correlation
- Kendall’s Correlation
Of the three, the simplest and also the most widely used is the Pearson linear correlation coefficient. This can be used when the paired data meets the following conditions:
- The relationship between the variables is linear.
- Both variables are quantitative.
- Both variables follow a normal distribution (although some authors argue that Pearson’s correlation can be used even if the variables do not fit perfectly to a Gaussian bell).
- The variance of the variable that is taken as the dependent variable (the one we represent on the Y axis) is constant for the different values of the independent variable (the one on the X axis).
If these conditions are met, we can calculate the Pearson correlation coefficient to determine how good the linear correlation is between both variables.
If we know the variances of both variables (s 2 x ys 2 y ) and the covariance (Cov x,y os xy ), we can calculate the Pearson coefficient for the population (ρ xy ) using the following formula:
On the other hand, the most common is that we do not know all the data of the population, but only have a sample. In this case, we can calculate the sample Pearson correlation coefficient, which is an estimator of the population. It is calculated by means of the following formula:
Where r is the correlation coefficient, x̅ is the sample mean of the variable x, y̅ is the sample mean of the variable y, and x i and y i are the individual values of each of the two variables.
Least Squares Linear Regression Fit
Linear regression is the process of fitting a paired data series to a straight line. It involves obtaining the mathematical equation of the line that best fits the data series and, therefore, minimizes the average distance between all the points and the line when both are represented in a Cartesian coordinate system.
Linear regression is almost always carried out by the method of least squares and the result is the obtaining of the two parameters that define a line, namely the cut with the Y axis and the slope.
Regardless of whether a data series behaves linearly or not, it is always possible to obtain the equation of the line that best fits it. If we consider a variable that we take as independent, X, and another that we take as a dependent variable, Y, the equation of the line is given by:
In this equation, the coefficients a and b are the linear regression coefficients and represent, respectively, the Y-intercept and the slope of the line. It can easily be shown that the coefficients that minimize the square of the model prediction error (the difference between the true value and the value estimated by the model) are given by:
The relationship between the slope of the linear regression line, b, and the correlation coefficient, r
Now that we are more clear about what the linear regression coefficients a and b are and what the Pearson linear correlation coefficient r is , we are ready to understand why and how the slope b is related to r .
In fact, the combination of the above equation for b and the definition of the Pearson coefficient, results in the mathematical relationship between these two statistics, for the case of a sample of data:
As can be seen, since the sample standard deviations s x and s y are, by definition, positive (since they are the positive square root of the respective variances), their quotient will necessarily be positive. For this reason, the sign of the slope, b , is determined by the sign of the correlation coefficient, r , and vice versa.
In addition, since the slope is expressed as the product between r and the aforementioned quotient between the two standard deviations, in the cases in which the two variables do not show any correlation (that is, when it is verified that r = 0 ) , then the slope of the line fitted by linear regression to the data will also be zero, as we observed previously.
This makes a lot of sense, since, if all the other factors that affect the dependent variable hold, if there is no correlation between it and the independent variable, it is to be expected that a change in the independent (that is, in x) will not will produce no observable change in the first (ie, in y). Consequently, as we move from left to right along the graph, we will not observe any increase or decrease in the y-values, and any variation that we do observe is due solely to the random nature of that variable.
Relationship between Pearson’s coefficient and slope in the case of population data
What has just been said in relation to the sample data applies in the same way in the case of having all the data of a population. The only thing that changes is that, instead of statistics ( a, b and r ), in the case of the population we are in the presence of parameters.
As is common in statistics, parameters are usually represented by the same letters as statistics, only using the letters of the Greek alphabet. For this reason, the cutoff and slope of the line fitted to all population data are represented by the letters α and β (instead of a and b ) , and the Pearson coefficient is represented by the letter ρ (instead of ). r ), while population standard deviations are represented by the letter s (instead of s ).
Thus, the relationship between the slope and the linear correlation coefficient for the population is given by:
References
Carollo Limeres, MC (2012). SIMPLE LINEAR REGRESSION . University of Santiago de Compostela. http://eio.usc.es/eipc1/BASE/BASEMASTER/FORMULARIOS-PHP-DPTO/MATERIALES/Mat_50140116_Regr_%20simple_2011_12.pdf
LesKanaris. (nd). What is paired data in statistics? – Tips – 2022 . https://us.leskanaris.com/7419-paired-data-in-statistics.html
Martinez Vara De Rey, CC (sf). Data Analysis in Psychology II – Pearson’s Linear Correlation Coefficient . Sevilla University. https://personal.us.es/vararey/correlacion-lineal-pearson.pdf
Rodrigo, JA (2016, June). Linear Correlation and Simple Linear Regression . CienciaDeDatos.Net. https://www.cienciadedatos.net/documentos/24_correlacion_y_regresion_lineal
Santos Cuervo, L. (2000). Regression and Correlation . discards. http://recursostic.educacion.es/descartes/web/Descartes1/Bach_CNST_1/Variables_estadisticas_bidimensionales_regresion_correlacion/regresi2.htm
Superprof. (2020, May 25). What is the regression line? | Superprof . Didactic Material – Superprof. https://www.superprof.es/apuntes/escolar/matematicas/estadistica/disbidimension/recta-de-regresion.html
Ucha, AP (2021, February 19). Linear correlation coefficient . Economipedia. https://economipedia.com/definiciones/coeficiente-de-correlacion-lineal.html