Exploring Correlations in the Gapminder Dataset. A Case Study Using Python

3/14/2016

In this post we’ll examine how the Pearson’s Correlation Index (aka rho, or r) can be used to assess linear relationship between two variables. Just like the previous two case studies, we’ll be using Python for this (it would be too easy to do this in R, or Julia, plus Python seems to be quite popular these days, while it’s a foxy data science tool overall). We’ll be using the Gapminder dataset, since it is large enough and diverse enough to be interesting, without being too complex, while it hold some meaningful information that can be insightful to people all over the globe).

After we engineer the data a bit, so that all the variables imported from the corresponding .csv file are in numeric format, and all the missing values are represented accordingly (NaNs), we create two subsets, each containing a pair of variables the we wish to examine. Namely, we’ll be looking at how the income per person is related to the alcohol consumption as well as how the overall employment rate and the female employment rate correlate.

Once the data has been cleaned of the missing values, we apply the Pearson’s correlation metric from the stats group of functions in the scipy package on each one of the cleaned subsets, clean_data_1 and clean_data_2. The results of the corresponding script are as follows:

association between alcconsumption and incomeperperson
(0.29539248727214601, 5.9610911215677288e-05)
proportion of variance of incomeperperson explained by alcconsumption variable (or vice versa)
0.0872567215368

and

association between employrate and femaleemployrate
(0.85750004944391978, 1.1114525346943375e-52)
proportion of variance of employrate explained by femaleemployrate variable (or vice versa)
0.735306334796

In the first case we have a correlation r = 0.295, which is positive but quite weak. Nevertheless, the corresponding p-value of 5.961e-5 is very small, making this result statistically significant, by any of the usual standards. From this we can conclude that there is no linear relationship between income per person and alcohol consumption. This is also reflected in the fact that only 8.7% of one variable’s variance can be explained by the other, a quite small proportion.

In the second case we have a correlation r = 0.858, which is also positive but much stronger. Also, the corresponding p-value is 1.111e-52 which is really small, making the result extremely significant by even the strictest alpha threshold. Naturally, the proportion of the employment rate explained by the female employment rate (or the other way around) is quite high: 73.5%. All this shows that there is a strong linear relationship between the two variables, something we would expect.

In another post we’ll examine other ways of measuring the correlation of quantitative variables, going beyond linear relationships. In the meantime you can check out the code created for this case study and the corresponding data, in the attachments below.

assignment_3.py
File Size:	1 kb
File Type:	py

Download File

gapminder.csv
File Size:	30 kb
File Type:	csv

Download File

2 Comments

best essay writing australia link

10/24/2017 12:45:52 am

Pearson correlation is considered to be the most perfect and reliable method of finding the correlation between the two variables. It is used when the researcher wants to determine the linear relationship between two or more than two variables.

Zack link

10/24/2017 01:10:03 am

Pearson Correlation is a useful metric, as shown in this post (which is one of my earliest ones in this blog). However, it is definitely a good idea to explore alternative correlation metrics, as there are problems where it doesn't perform that well. In any case, it doesn't hurt to know it well and be able to use it for measuring correlation of numeric variables, like in this case study.

Your comment will be posted after it is approved.

FOXY DATA SCIENCE
unconventional insights about data science, A.I., cybersecurity, data analytics, and more