Lately everyone likes to talk big picture when it comes to data science and artificial intelligence. I’m guilty of this too, since this kind of talk lends itself for blogging. However, it is easy to get carried away and forget that data science is a very detailed process that requires meticulous work. After all, no matter how much automation takes place, mistakes are always possible and oftentimes unavoidable. Even if programming bugs are easier to identify and even prevent, to some extent, some problem may still arise and it is the data scientist’s obligation to handle them effectively.
I’ll give an example from a recent project of mine, a PoC in the text analytics field. The idea was to develop a bunch of features from various texts and then use them to build an unsupervised learning model. Everything in the design and the core functions was smooth, even from the first draft of the code. Yet, when running one of the scripts, the computer kept running out of memory. That’s a big issue, considering that the text corpus was not huge, plus the machine used to run the programs is a pretty robust system, with 16GB of RAM, while it’s also running Linux (so a solid 15GB of RAM are available to the programming language to utilize as needed). Yet, the script would cause the system to slow down until it would eventually freeze (no swap partition was set up when I was installing the OS, since I didn’t expect to ever run out of memory on this machine!) Of course, the problem could be resolved by adding a swap option to the OS, but that still would not be a satisfactory solution, at least not for someone who opts for writing efficient code. After all, when building a system, it is usually built to scale well and this prototype of mine didn’t look very scalable. So, I examined the code carefully and came up with various hacks to manage resources better. Also, I got rid of some unnecessary array that was eating up a lot of memory, and rerouted the information flow so that other arrays can be used to provide the same result. After a couple of attempts, the system was running smoothly and without using too much RAM.
It’s small details like these that make the difference between a data science system that is practical and one that is good only on the conceptual level (or one that requires a large cluster to run properly). Unfortunately, that’s something that is hard to learn through books, videos, or other educational material. Perhaps even conventional experience may not trigger this kind of lesson, though perhaps a good mentor might be very beneficial in such cases. The morale of the story for me is that we ought to continuously challenge ourselves in data science and never be content with our aptitude level. Just because something runs without errors identifiable by the language compiler, doesn’t mean that it’s production-ready. Even in the case of a simple PoC, like this one, we cannot afford to lose focus. Just like the data that is constantly evolving into more and more refined information, data scientists follow a similar process, as we grow into more refined manifestations of the craft.
Are the Countries of One Hemisphere Different to Those of the Other in Terms of Income Level? A Case Study Using Stats in Python
That’s a quite interesting question and coincidentally one that we can answer using the Gapminder data. First we’ll need to establish the hemisphere of each country, a fairly irksome but doable task. Using the data from www.mapsofworld.com and Wikipedia (for the countries where the site failed to deliver any information), we can classify each country to the North or the South hemisphere. For countries that are more or less equally divided between the two, we just insert a missing value (denoted as “NA” in our dataset). Note that a country that crosses the equator is still classified to one or the other hemisphere, if the majority of its area or if the majority of its main cities are on one side of it.
As for the income level of a country, although no such variable exists in the original dataset, we can employ the same transformation as we did in the previous case study and end up with a tertiary variable having the values high, medium, and low. Now, on to the hypotheses. Our null hypothesis would be “there is difference between the two hemispheres, regarding the proportions of countries of high, medium, and low income level”. In other words, the proportions should be more or less the same. The alternative hypothesis would be opposite of that, i.e. “there is a difference in the proportions of countries of high, medium, and low income level, between the two hemispheres.”
To test this we’ll employ the chi-square test since:
1. Both our variables are discreet / categorical
2. There are enough countries in all possible combinations of hemisphere and income level
For our tests we’ll employ the significance level of alpha = 0.05, which is quite common for this sort of analysis.
Once we remove the missing data from the variables (in data-frame sub1) by first replacing them with NaNs, we can create the contingency table ct1 and perform the chi square test on it. As there are 2 and 3 unique values in the variables respectively, the contingency table will be a 2 x 3 matrix:
income_level high low medium
North 57 44 45
South 6 15 16
It is clear that the North hemisphere has more countries, so the high numbers of each income level category may be misleading. This problem goes away if we take the proportions of these hemisphere-income_level combos instead:
income_level high low medium
North 0.904762 0.745763 0.737705
South 0.095238 0.254237 0.262295
Wow! 90% of all rich countries (countries with income level = high) are on the North part of the globe, with only barely 10% in the South. For the other types of countries the proportions are somewhat different and more similar to each other. So, if there is a discrepancy among the proportions, it is due to the rich countries being mainly in the Northern hemisphere. The chi-square test supports this intuition:
Chi-square value: 6.8244834439401423 (quite high)
P-value: 0.032967214150344128 (about 3%)
As p < alpha, we can safely reject the Null hypothesis and confidently say that the proportions are indeed skewed. Of course there is a 3% chance that we are wrong, but we can live with it!
Before we go deeper (ad hoc analysis) and examine if the pairwise comparisons are also significant, we need to apply the Bonferroni correction to the alpha value (not to be confused with the pepperoni correction, which applies mainly to pizzas!):
Adjusted alpha = alpha / (number of comparisons)
In our case the comparison we can make are high to low income countries, medium to low, and high to medium, three in total. Therefore, our new alpha is a = 0.017 (1.7%), making the chi-square tests a bit less eager to yield significant results.
By calculating the contingency tables for each one of these comparisons, we the following results respectively:
Comparison: High income level to Low income level
Chi-square value: 4.3468868966915135
P-value: 0.037076661090159897 (relatively low but not significant)
Comparison: Medium income level to Low income level
Chi-square value: 0.011613712922753157
P-value: 0.91418056977624251 (quite high and definitely not significant)
Comparison: High income level to Medium income level
Chi-square value: 4.8370929759550796
P-value: 0.027853811382628445 (quite low but not significant)
So, although we obtained some pretty good results originally (rejecting the null hypothesis), the ad hocs are not all that great, since none of the in-depth comparisons yield anything statistically significant (as they are all larger than the adjusted alpha). In other words, if we are given a couple of proportions for the income_level variable, we can not confidently predict the hemisphere. However, if we are given all three proportions, predicting the hemisphere is quite doable.
In one of the next posts we’ll examine a more foxy way of analyzing the same data, so feel free to revisit this blog. In the meantime, you can validate these results by examining the code and the data, made available in the attached files.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.