Last updated on June 4th, 2025
The concepts of correlation and regression are fundamental in statistical analysis to define relationships between variables. Correlation measures the strength and direction of the relationship between two variables, which ranges from -1 to +1, here the -1 represents a perfect negative relationship and +1 represents a perfect positive relationship. Regression, on the other hand, allows us to predict the value of a dependent variable based on an independent variable. We will now understand more about correlation and regression in the topic below.
Correlation can be defined as the measurement that is used to quantify the relationship between variables. If an increase or decrease in one variable causes the same increase or decrease in the other variable, then it is said to be in direct correlation. On the other hand, if it is the opposite, then it is indirectly correlated. The representation of direct correlation and indirect correlation is +1 and -1 respectively.
Regression is the measurement that is used to show how the change in one variable will affect the other variable. Regression is the method that is used to find the cause and effect between two variables. Linear regression is the most common type of regression which is best fit to establish a relationship between two or more variables.
There are many types of correlation coefficients. Let us now see the most commonly used types of correlation coefficients:
This type of correlation coefficient measures the linear relationship between two continuous variables. The coefficient of values ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 being no correlation. This type of correlation coefficient assumes all the variables are distributed normally.
The Spearman’s rank coefficient measures the strength and the direction of a relationship between two variables. We use this type for ranked data or when the assumption of normality is violated. It is based on ranking rather than the actual values.
This type of correlation coefficient measures the ranked association between two variables. We use it when the data has ties or small sample sizes. It compares a number of concordant and discordant pairs in ranking.
There are several types of regression. Let us see some main types of regression mentioned below:
We use this type of regression to model the relationship between one independent and one dependent variable using a straight line. It looks at two variables and tries to find if one variable makes a contribution to the other through putting it in a place on a linear equation.
This type of regression looks at the effect of two or more independent variables on one dependent variable. This extends the linear regression to multiple independent variables.
This type of regression shows the non-linear relationships by adding polynomial terms like squared, cubed, etc. It plans the connection between the dependent and independent variables by a polynomial function.
We use this type of regression for binary classification like yes/no or 0/1 outcomes. We also use a logistic function instead of a straight line. Likewise, we apply this type of regression where the dependent variable is qualitative.
There are a lot of differences between correlation and regression. Let us see the differences of correlation and regression in the given table below:
Correlation | Regression |
It measures the strength and direction of the relationship between two variables. | It models the relationship between independent and dependent variables, which allows for predictions |
The range of the values lie in between -1 and +1. | Regression has no fixed range. |
The purpose is to show how strongly two variables are related. | It establishes a cause and effect relationship and predicts one variable based on the other. |
We use it in statistics, economics, finance, psychology, and research | We use it for forecasting, data modelling, risk analysis and machine learning. |
The formulas and the explanation about each formula of correlation and regression, respectively are mentioned below:
The formula most commonly used for correlation is the Pearson’s correlation coefficient formula:
r = (X - X)(Y - Y)(X - X)2 * (Y - Y)2
Where,
r = Pearson’s correlation coefficient (ranges for -1 to +1)
X, Y = Individual data point for variables X and Y
X, Y = Mean (average) of X and Y.
= Summation Symbol.
Numerator shows the covariance between X and Y.
Denominator shows the product of the standard deviations of X and Y.
The most commonly used regression formula is the linear regression formula, which is mentioned below:
Y = a + bX
Where,
a = Intercept
b = Slope
Y = Dependent variable
X = Independent Variable.
There are a lot of applications of correlation and regression. Let us now see the different uses of correlation and regression in different fields:
We use correlation in economics and finance, where it is used to analyze the relationship between economic indicators.
We use regression in economics and finance to predict the future trends, use GDP growth to forecast employment rates.
We use correlation and regression in stock markets, where correlation is used for the study of the correlation between stock prices.
Regression is used to forecast the stock prices.
We use correlation and regression in stock markets, where correlation is used to find the risk factors for diseases.
We use regression to make medical predictions.
Students tend to make mistakes when they solve problems related to correlation and regression. Let us now see the common mistakes they make and the solutions to avoid them:
Compute the Pearson correlation coefficient for the paired data: X = [1, 2, 3, 4, 5] and Y = [2, 4, 5, 4, 5]
r 0.78
Compute the means:
X = (1 + 2 + 3 + 4 + 5)/5 = 3.
Y = (2 + 4 + 5 + 4 + 5)/5 = 4
Compute the deviations and their products:
X | Y | X - X | Y - Y | (X - X)(Y - Y) | (X - X)2 | (Y - Y)2 |
1 | 2 | -2 | -2 | 4 | 4 | 4 |
2 | 4 | -1 | 0 | 0 | 1 | 0 |
3 | 5 | 0 | 1 | 0 | 0 | 1 |
4 | 4 | 1 | 0 | 0 | 1 | 0 |
5 | 5 | 2 | 1 | 2 | 4 | 1 |
Sum the Products and squares:
(X - X)(Y - Y) = 4 + 0 + 0 + 0 + 2 = 6
(X - X)2 = 4 + 1 + 0 + 1+ 4 = 10
(Y - Y)2 = 4 + 0 + 1 + 0 + 1 = 6
Apply the Pearson formula:
r = 610 * 6 = 660 = 6/7.746 = 0.775
For the ranked data X = [1, 2, 3, 4, 5] and Y = [2, 3, 5, 4, 1], compute Spearman’s rank correlation coefficient.
ρ = 0.9
Assign Ranks:
Y ranks:
Y = 2 → Rank 1
Y = 3 → Rank 2
Y = 5 → Rank 3
Y = 4 → Rank 4
Y = 1 → Rank 5
Calculate the differences between the ranks of X and Y
Observation | Rank X | Rank Y | d = Rank X - Rank Y | d2 |
1 | 1 | 1 | 0 | 0 |
2 | 2 | 2 | 0 | 0 |
3 | 3 | 4 | -1 | 1 |
4 | 4 | 3 | 1 | 1 |
5 | 5 | 5 | 0 | 0 |
Sum the squared differences:
d2 = 0 + 0 + 1 + 1 + 0 = 2
Applying the Spearman’s formula:
ρ = 1 - 6d2n(n2 - 1) = 1 - 6 * 25(25 - 1) = 1 – 12/120 = 1–0.1 = 0.9
For the dataset X = [2, 4, 6, 8, 10] and Y = [20, 16, 12, 8, 4], find the Pearson correlation coefficient.
r = -1
Means:
X = (2 + 4 + 6 + 8 + 10)/5 = 6
Y = (20 + 16 + 12 + 8 + 4)/5 = 12
Deviations and products:
X | Y | X - X | Y - Y | Product |
2 | 20 | -4 | 8 | -32 |
4 | 16 | -2 | 4 | -8 |
6 | 12 | 0 | 0 | 0 |
8 | 8 | 2 | -4 | -8 |
10 | 4 | 4 | -8 | -32 |
∑(X − X)(Y − Y) = -32 - 8 + 0 – 8 – 32 = -80
Sums of Squares:
∑(X − X) = (-4)2 + (-2)2 + 02 + 22 + 42 = 16 + 4 + 0 + 4 + 16 = 40
∑(Y − Y) = 82 + 42 + 02 + (-4)2 + (-8)2 = 64 +16 + 0 + 16 + 64 = 160
Pearson’s r:
r = -8040 * 160 = -806400 = -80/80 = -1
Using the dataset X = [1, 2, 3, 4, 5] and Y = [2, 4, 6, 8, 10], determine the regression line and predict Y when X = 7.
Y = 2X and predicted Y(7) = 14
Find the Means:
X = 3 and Y = 6
Compute the slope:
b = Change in Y/Change in X = (10-2)/(5-1) = 8/4 = 2
Intercept:
a = Y - bX = 6 – 2 x 3 = 0
Prediction:
Regression equation: Y = 2X
For X = 7: Y = 2 x 7 = 14.
Given the regression equation, Y = 5 + 3X, interpret the slope and intercept.
Intercept = 5 and Slope = 3
Intercept:
When X = 0, the predicted Y is 5. This is the baseline value.
Slope:
For each unit increase in X, Y increases by 3 units.