Lindley Coetzee

Naked Statistics by Charles Wheelan: Part 3 - Correlation

Introduction

Welcome to part 3 of this series. We will focus on Chapter 4:Correlation. Correlation measures the degree to which two variables are related to one another. When two variables are positively correlated, a change in variable A is associated with in  a change in variable B in the same direction. When two variables are negatively correlated,a positive change in variable A is associated with a negative change in variable B or vice versa.

We will again use the same hockey data set. In order to better understand what will be covered here I suggest you read part 1 and part2 of this series first.

Standard units

This was how our data looked from Part 2 - Standard deviation.

The means for height and weight are 182.04 and 83.74.

The standard deviations for height and weight are 5.87 and 8.55.

We will calculate the standard units for height and weight using the below formulas:

(height – mean) / standard deviation

(weight – mean) / standard deviation

Let us do the calculations and update the data set with the two new columns.

For clarity,let us calculate the standard units for Bobby Smith.

(height – mean) / standard deviation

(193 – 182.04)/5.87

10.96/5.87

1.87

(weight – mean) / standard deviation

(95 – 83.74)/8.55

11.26/8.55

1.32

Height standard units x Weight standard units      

Now we need to find the product of the columns we just inserted.

Correlation

Now that we have solved for the product of height in standard unit and weight in standard unit,we can find the correlation. I’m gonna change this column name to “Product of SUs” because “height in standard unit x weight in standard unit” or “Height in SU x  Weight in SU” is too long for my liking.

Much better. :)

Now all we need to do is sum the Product of SUs column and divide that by the number ofplayers.

Sum of Product of SUs:                                  54,568.97255

Number of players                                          82,424

Correlation                                                      66%

This correlation of 66% is not as high as the correlation of 83% found on a NBA basketball data set. You can check out that data set here.

Function for standard deviation in Excel and Python

Again after all of those calculations, there are functions in Excel and Python which calculates thestandard deviation quickly and easily.

For Excel :

=CORREL (array1,array2).

For Python :

df.corr()

That is all for now. See you next time.