That’s a quite interesting question and coincidentally one that we can answer using the Gapminder data. First we’ll need to establish the hemisphere of each country, a fairly irksome but doable task. Using the data from www.mapsofworld.com and Wikipedia (for the countries where the site failed to deliver any information), we can classify each country to the North or the South hemisphere. For countries that are more or less equally divided between the two, we just insert a missing value (denoted as “NA” in our dataset). Note that a country that crosses the equator is still classified to one or the other hemisphere, if the majority of its area or if the majority of its main cities are on one side of it. As for the income level of a country, although no such variable exists in the original dataset, we can employ the same transformation as we did in the previous case study and end up with a tertiary variable having the values high, medium, and low. Now, on to the hypotheses. Our null hypothesis would be “there is difference between the two hemispheres, regarding the proportions of countries of high, medium, and low income level”. In other words, the proportions should be more or less the same. The alternative hypothesis would be opposite of that, i.e. “there is a difference in the proportions of countries of high, medium, and low income level, between the two hemispheres.” To test this we’ll employ the chisquare test since: 1. Both our variables are discreet / categorical 2. There are enough countries in all possible combinations of hemisphere and income level For our tests we’ll employ the significance level of alpha = 0.05, which is quite common for this sort of analysis. Once we remove the missing data from the variables (in dataframe sub1) by first replacing them with NaNs, we can create the contingency table ct1 and perform the chi square test on it. As there are 2 and 3 unique values in the variables respectively, the contingency table will be a 2 x 3 matrix: income_level high low medium hemisphere North 57 44 45 South 6 15 16 It is clear that the North hemisphere has more countries, so the high numbers of each income level category may be misleading. This problem goes away if we take the proportions of these hemisphereincome_level combos instead: income_level high low medium hemisphere North 0.904762 0.745763 0.737705 South 0.095238 0.254237 0.262295 Wow! 90% of all rich countries (countries with income level = high) are on the North part of the globe, with only barely 10% in the South. For the other types of countries the proportions are somewhat different and more similar to each other. So, if there is a discrepancy among the proportions, it is due to the rich countries being mainly in the Northern hemisphere. The chisquare test supports this intuition: Chisquare value: 6.8244834439401423 (quite high) Pvalue: 0.032967214150344128 (about 3%) As p < alpha, we can safely reject the Null hypothesis and confidently say that the proportions are indeed skewed. Of course there is a 3% chance that we are wrong, but we can live with it! Before we go deeper (ad hoc analysis) and examine if the pairwise comparisons are also significant, we need to apply the Bonferroni correction to the alpha value (not to be confused with the pepperoni correction, which applies mainly to pizzas!): Adjusted alpha = alpha / (number of comparisons) In our case the comparison we can make are high to low income countries, medium to low, and high to medium, three in total. Therefore, our new alpha is a = 0.017 (1.7%), making the chisquare tests a bit less eager to yield significant results. By calculating the contingency tables for each one of these comparisons, we the following results respectively: Comparison: High income level to Low income level Chisquare value: 4.3468868966915135 Pvalue: 0.037076661090159897 (relatively low but not significant) Comparison: Medium income level to Low income level Chisquare value: 0.011613712922753157 Pvalue: 0.91418056977624251 (quite high and definitely not significant) Comparison: High income level to Medium income level Chisquare value: 4.8370929759550796 Pvalue: 0.027853811382628445 (quite low but not significant) So, although we obtained some pretty good results originally (rejecting the null hypothesis), the ad hocs are not all that great, since none of the indepth comparisons yield anything statistically significant (as they are all larger than the adjusted alpha). In other words, if we are given a couple of proportions for the income_level variable, we can not confidently predict the hemisphere. However, if we are given all three proportions, predicting the hemisphere is quite doable. In one of the next posts we’ll examine a more foxy way of analyzing the same data, so feel free to revisit this blog. In the meantime, you can validate these results by examining the code and the data, made available in the attached files.
0 Comments
Leave a Reply. 
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests. Archives
August 2017
Categories
All
