In this post we’ll examine how the Pearson’s Correlation Index (aka rho, or r) can be used to assess linear relationship between two variables. Just like the previous two case studies, we’ll be using Python for this (it would be too easy to do this in R, or Julia, plus Python seems to be quite popular these days, while it’s a foxy data science tool overall). We’ll be using the Gapminder dataset, since it is large enough and diverse enough to be interesting, without being too complex, while it hold some meaningful information that can be insightful to people all over the globe). After we engineer the data a bit, so that all the variables imported from the corresponding .csv file are in numeric format, and all the missing values are represented accordingly (NaNs), we create two subsets, each containing a pair of variables the we wish to examine. Namely, we’ll be looking at how the income per person is related to the alcohol consumption as well as how the overall employment rate and the female employment rate correlate. Once the data has been cleaned of the missing values, we apply the Pearson’s correlation metric from the stats group of functions in the scipy package on each one of the cleaned subsets, clean_data_1 and clean_data_2. The results of the corresponding script are as follows: association between alcconsumption and incomeperperson (0.29539248727214601, 5.9610911215677288e05) proportion of variance of incomeperperson explained by alcconsumption variable (or vice versa) 0.0872567215368 and association between employrate and femaleemployrate (0.85750004944391978, 1.1114525346943375e52) proportion of variance of employrate explained by femaleemployrate variable (or vice versa) 0.735306334796 In the first case we have a correlation r = 0.295, which is positive but quite weak. Nevertheless, the corresponding pvalue of 5.961e5 is very small, making this result statistically significant, by any of the usual standards. From this we can conclude that there is no linear relationship between income per person and alcohol consumption. This is also reflected in the fact that only 8.7% of one variable’s variance can be explained by the other, a quite small proportion. In the second case we have a correlation r = 0.858, which is also positive but much stronger. Also, the corresponding pvalue is 1.111e52 which is really small, making the result extremely significant by even the strictest alpha threshold. Naturally, the proportion of the employment rate explained by the female employment rate (or the other way around) is quite high: 73.5%. All this shows that there is a strong linear relationship between the two variables, something we would expect. In another post we’ll examine other ways of measuring the correlation of quantitative variables, going beyond linear relationships. In the meantime you can check out the code created for this case study and the corresponding data, in the attachments below.
2 Comments
10/24/2017 12:45:52 am
Pearson correlation is considered to be the most perfect and reliable method of finding the correlation between the two variables. It is used when the researcher wants to determine the linear relationship between two or more than two variables.
Reply
Pearson Correlation is a useful metric, as shown in this post (which is one of my earliest ones in this blog). However, it is definitely a good idea to explore alternative correlation metrics, as there are problems where it doesn't perform that well. In any case, it doesn't hurt to know it well and be able to use it for measuring correlation of numeric variables, like in this case study.
Reply
Leave a Reply. 
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests. Archives
April 2018
Categories
All
