Are the Countries of One Hemisphere Different to Those of the Other in Terms of Income Level? A Case Study Using Stats in Python
That’s a quite interesting question and coincidentally one that we can answer using the Gapminder data. First we’ll need to establish the hemisphere of each country, a fairly irksome but doable task. Using the data from www.mapsofworld.com and Wikipedia (for the countries where the site failed to deliver any information), we can classify each country to the North or the South hemisphere. For countries that are more or less equally divided between the two, we just insert a missing value (denoted as “NA” in our dataset). Note that a country that crosses the equator is still classified to one or the other hemisphere, if the majority of its area or if the majority of its main cities are on one side of it.
As for the income level of a country, although no such variable exists in the original dataset, we can employ the same transformation as we did in the previous case study and end up with a tertiary variable having the values high, medium, and low. Now, on to the hypotheses. Our null hypothesis would be “there is difference between the two hemispheres, regarding the proportions of countries of high, medium, and low income level”. In other words, the proportions should be more or less the same. The alternative hypothesis would be opposite of that, i.e. “there is a difference in the proportions of countries of high, medium, and low income level, between the two hemispheres.”
To test this we’ll employ the chi-square test since:
1. Both our variables are discreet / categorical
2. There are enough countries in all possible combinations of hemisphere and income level
For our tests we’ll employ the significance level of alpha = 0.05, which is quite common for this sort of analysis.
Once we remove the missing data from the variables (in data-frame sub1) by first replacing them with NaNs, we can create the contingency table ct1 and perform the chi square test on it. As there are 2 and 3 unique values in the variables respectively, the contingency table will be a 2 x 3 matrix:
income_level high low medium
North 57 44 45
South 6 15 16
It is clear that the North hemisphere has more countries, so the high numbers of each income level category may be misleading. This problem goes away if we take the proportions of these hemisphere-income_level combos instead:
income_level high low medium
North 0.904762 0.745763 0.737705
South 0.095238 0.254237 0.262295
Wow! 90% of all rich countries (countries with income level = high) are on the North part of the globe, with only barely 10% in the South. For the other types of countries the proportions are somewhat different and more similar to each other. So, if there is a discrepancy among the proportions, it is due to the rich countries being mainly in the Northern hemisphere. The chi-square test supports this intuition:
Chi-square value: 6.8244834439401423 (quite high)
P-value: 0.032967214150344128 (about 3%)
As p < alpha, we can safely reject the Null hypothesis and confidently say that the proportions are indeed skewed. Of course there is a 3% chance that we are wrong, but we can live with it!
Before we go deeper (ad hoc analysis) and examine if the pairwise comparisons are also significant, we need to apply the Bonferroni correction to the alpha value (not to be confused with the pepperoni correction, which applies mainly to pizzas!):
Adjusted alpha = alpha / (number of comparisons)
In our case the comparison we can make are high to low income countries, medium to low, and high to medium, three in total. Therefore, our new alpha is a = 0.017 (1.7%), making the chi-square tests a bit less eager to yield significant results.
By calculating the contingency tables for each one of these comparisons, we the following results respectively:
Comparison: High income level to Low income level
Chi-square value: 4.3468868966915135
P-value: 0.037076661090159897 (relatively low but not significant)
Comparison: Medium income level to Low income level
Chi-square value: 0.011613712922753157
P-value: 0.91418056977624251 (quite high and definitely not significant)
Comparison: High income level to Medium income level
Chi-square value: 4.8370929759550796
P-value: 0.027853811382628445 (quite low but not significant)
So, although we obtained some pretty good results originally (rejecting the null hypothesis), the ad hocs are not all that great, since none of the in-depth comparisons yield anything statistically significant (as they are all larger than the adjusted alpha). In other words, if we are given a couple of proportions for the income_level variable, we can not confidently predict the hemisphere. However, if we are given all three proportions, predicting the hemisphere is quite doable.
In one of the next posts we’ll examine a more foxy way of analyzing the same data, so feel free to revisit this blog. In the meantime, you can validate these results by examining the code and the data, made available in the attached files.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.