Introduction
Welcome to part 3 of this series. We will focus on Chapter 4:Correlation. Correlation measures the degree to which two variables are related to one another. When two variables are positively correlated, a change in variable A is associated with in a change in variable B in the same direction. When two variables are negatively correlated,a positive change in variable A is associated with a negative change in variable B or vice versa.
We will again use the same hockey data set. In order to better understand what will be covered here I suggest you read part 1 and part2 of this series first.
Standard units
This was how our data looked from Part 2 - Standard deviation.
The means for height and weight are 182.04 and 83.74.
The standard deviations for height and weight are 5.87 and 8.55.
We will calculate the standard units for height and weight using the below formulas:
(height – mean) / standard deviation
(weight – mean) / standard deviation
Let us do the calculations and update the data set with the two new columns.
For clarity,let us calculate the standard units for Bobby Smith.
(height – mean) / standard deviation
(193 – 182.04)/5.87
10.96/5.87
1.87
(weight – mean) / standard deviation
(95 – 83.74)/8.55
11.26/8.55
1.32
Height standard units x Weight standard units
Now we need to find the product of the columns we just inserted.
Correlation
Now that we have solved for the product of height in standard unit and weight in standard unit,we can find the correlation. I’m gonna change this column name to “Product of SUs” because “height in standard unit x weight in standard unit” or “Height in SU x Weight in SU” is too long for my liking.
Much better. :)
Now all we need to do is sum the Product of SUs column and divide that by the number ofplayers.
Sum of Product of SUs: 54,568.97255
Number of players 82,424
Correlation 66%
This correlation of 66% is not as high as the correlation of 83% found on a NBA basketball data set. You can check out that data set here.
Function for standard deviation in Excel and Python
Again after all of those calculations, there are functions in Excel and Python which calculates thestandard deviation quickly and easily.
For Excel :
=CORREL (array1,array2).
For Python :
df.corr()
That is all for now. See you next time.