Introduction to correlation coefficient

Correlation is a technique that measures how well two variables are ‘related’ to each other. Simply put, it measures how well two variables are linked to each other.

Correlation is useful to understand the relation among variables and what variables to consider for analysis. Correlation helps us to draw insights.

Example

For example, let us take two variables ‘Stock Price of a firm’ and ‘quarterly performance of firm’. In general we observe that if the firm performs well (In a particular quarter/year),    its stock price goes up and if the performance is poor, stock price falls. Similarly, if we consider ‘Supply of goods in market’ and ‘Prices of goods; we can observe that when there is excess supply of goods, prices tend to fall.

Corra6.jpg

Positive correlation does not necessarily mean that every time when one variable goes up, the other variable will go up. Similarly negative correlation does not mean when one variable goes down the other will go up.

Corra5.jpg

Degree of correlation further sheds light on how strong or how weak the correlation is.

Mathematically!

Now that we have seen the concept of correlation, let us look at this mathematically. Correlation varies by degree and the measure of which is expressed as Correlation coefficient.

Correlation3
Correlation coefficient

When we say x and y move in same direction, it means, in general, y moves in the same direction as x. This may not necessarily be true for each and every (x,y) pair values in the given dataset.

Correlation coefficient lies in between -1 to +1. -1 indicates perfect negative correlation while +1 indicates perfect positive correlation. 0 indicates no correlation, that means, the two variables are completely independent of each other. Variation in one variable is not dependent on another. For example, if x and y are two variables, for varied values of x if y remains constant then there exists zero correlation between x and y.

Correlation2
Examples of +1, 0 and -1 correlation datasets

Impact of Outliers on correlation coefficient

Impact If two variables, x and y are positively correlated and the variables y and z are positively correlated, then can we say that x and z are positively correlated? Put another way, does the transitive property holds good for correlation?

Transitive property does NOT hold true in case of correlation. One simple example to explain this presented below which also shows the impact of outliers on correlation.

Correlation5
Transitive property is not true for correlation

Types of correlation

There are two types of correlation; Pearson correlation and Spearman rank correlation.

Pearson correlation measures the degree of linear correlation between two variables. Spearman rank correlation measures the correlation between variables (This does not assume linear relationship between two variables). Variables are ranked individually and then the ranks are used for calculating correlation coefficient. So, Pearson correlation applies only to variables that are either interval or ratio while spearman correlation applies to ordinal, interval and ratio scales. (Refer to Types of variables and measurement scales to understand Nominal, ordinal, interval and ratio scales)

Impact of outliers is minimized in case of spearman rank correlation as the outlier represents just another rank in the sequence and not the actual value.

Correlation6.jpg
Pearson correlation and Spearman rank correlation for a given dataset

Interpretation and inference of correlation coefficient

Let us look at how to infer correlation from scatter-plot of two variables x and y.

Correlation4

If we have a straight line plot, that means x and y have zero correlation.

‘Correlation does not mean causation’. That means if two variables have high correlation then an inference CAN NOT be made that one is dependent on another or one is the cause of other. For example, if x and y have high positive correlation, it does NOT mean that x is cause of y and also it does NOT mean that x is the sole influencer of Y. There could be several other variables apart from x that may influence the outcome (y). Hence, caution must be exercised while interpreting correlation coefficients.

Manual calculation of correlation coefficient

Correlation is readily available as an inbuilt function in almost every software; however, it is good to have an understanding of how correlation is calculated mathematically/manually.

Consider two variables x and y for which we want to calculate correlation; calculation is shown below.

Corr. Coeff = n*(∑XY)  –  (∑X)*(∑Y)/ SQRT[ (n*∑X^2 – (∑X)^2) * (n*∑Y^2 – (∑Y)^2)]

X Y XY X^2 Y^2
8 3 24 64 9
9 5 45 81 25
7 6 42 49 36
5 1 5 25 1
3 2 6 9 4
2 9 18 4 81
1 7 7 1 49
5 11 55 25 121
6 5 30 36 25
∑X ∑Y ∑XY ∑X^2 ∑Y^2
46 49 232 294 351

Corr. Coeff = 9*232  –  46*49/ SQRT[ (9*294 – (46)^2) * (9*351 – (49)^2)] = -0.26