Is there a relationship between internet use rate and income level? A case study using Stats in Python

3/1/2016

Based on the Gapminder data, there appears to be a fairly large variance in the internet use rate around the globe, as well as in the income per person. Could there be a significant signal there? Let's find out.

Our hypothesis is: "There is a measurable difference in the internet use rate, in relation to the income level of a country". Obviously, the null hypothesis in this case would be "internet use rate is the same across all income levels". But first let's define what income level is. The income per person variable is continuous and has several missing values. Once the latter are eliminated, we can calculate the trisection points, corresponding to the 33rd and 67th percentile. This will allow us to split the variable into three more or less equal parts, which we can call "low", "medium", and "high". The first part of our code shows precisely that:

# AUXILIARY STUFF
def categorize(x, m1, m2):
    if np.isnan(x): return np.NaN
    if x <= m1: return "low"
    if x > m2: return "high"
    return "medium"

# MAIN METHOD
# Initialization
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

data = pd.read_csv('gapminder.csv', low_memory=False)

#Set variables to be numeric
data["incomeperperson"] = data["incomeperperson"].convert_objects(convert_numeric=True)
data["internetuserate"] = data["internetuserate"].convert_objects(convert_numeric=True)

#Handle missing data
sub1 = data[["incomeperperson", "internetuserate"]] # select only relevant columns
sub1['incomeperperson'] = data['incomeperperson'].replace("", numpy.nan)
sub1['internetuserate'] = data['internetuserate'].replace("", numpy.nan)

#Break down incomer per person into 3 groups
X = np.array(sub1["incomeperperson"].dropna())
m1, m2 = np.percentile(X, [33 ,67]) # trisection points of incomeperperson variable
n = np.size(sub1, 0)
sub1["income_level"] = [""] * n
for i in xrange(n): sub1["income_level"][i] = categorize(sub1["incomeperperson"][i], m1, m2)

Now it's time to deploy a (Stats) model to examine the relationship between our variables (internetuserate and income_rate). Based on the nature of the variables, we choose to use the basic ANOVA model, as seen in the following code:

#Perform ANOVA on income_level and internetuserate variables to test initial hypothesis
sub3 = sub1[['internetuserate', 'income_level']].dropna()
model0 = smf.ols(formula='internetuserate ~ C(income_level)', data=sub3).fit()
print (model0.summary())

OLS Regression Results
==============================================================================
Dep. Variable:        internetuserate   R-squared:                       0.681
Model:                            OLS   Adj. R-squared:                  0.678
Method:                 Least Squares   F-statistic:                     192.3
Date:                Tue, 01 Mar 2016   Prob (F-statistic):           2.10e-45
Time:                        10:53:57   Log-Likelihood:                -764.25
No. Observations:                 183   AIC:                             1534.
Df Residuals:                     180   BIC:                             1544.
Df Model:                           2
Covariance Type:            nonrobust
=============================================================================================
                                coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                    66.1814      2.051     32.267      0.000        62.134    70.229
C(income_level)[T.low]      -55.8170      2.889    -19.322      0.000       -61.517   -50.117
C(income_level)[T.medium]   -36.4350      2.877    -12.664      0.000       -42.112   -30.758
==============================================================================
Omnibus:                        6.163   Durbin-Watson:                   2.014
Prob(Omnibus):                  0.046   Jarque-Bera (JB):                7.413
Skew:                          -0.240   Prob(JB):                       0.0246
Kurtosis:                       3.861   Cond. No.                         3.76
==============================================================================

From this summary, it is clear that the F-statistic is really high (over 190), something that is reflected in the p-value of the model (2e-45). It is clear that even at a very low alpha threshold (e.g. 0.001), this result is significant. So, there is a strong relationship between the income level of a country's citizens and their use of the internet. But does this hold across the different values of the income_level variable? The above model cannot say. If we want to find out...

To do that we apply the ANOVA model again, this time for each pair of the income_level variable:

sub = sub3[sub3['income_level'] != "high"]
model1 = smf.ols(formula='internetuserate ~ C(income_level)', data=sub).fit()
print (model1.summary())

OLS Regression Results
==============================================================================
Dep. Variable:        internetuserate   R-squared:                       0.331
Model:                            OLS   Adj. R-squared:                  0.326
Method:                 Least Squares   F-statistic:                     59.99
Date:                Tue, 01 Mar 2016   Prob (F-statistic):           3.25e-12
Time:                        11:01:08   Log-Likelihood:                -497.03
No. Observations:                 123   AIC:                             998.1
Df Residuals:                     121   BIC:                             1004.
Df Model:                           1
Covariance Type:            nonrobust
=============================================================================================
                                coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                    10.3644      1.777      5.834      0.000         6.847    13.882
C(income_level)[T.medium]    19.3820      2.502      7.746      0.000        14.428    24.336
==============================================================================
Omnibus:                        8.935   Durbin-Watson:                   1.969
Prob(Omnibus):                  0.011   Jarque-Bera (JB):                8.830
Skew:                           0.636   Prob(JB):                       0.0121
Kurtosis:                       3.324   Cond. No.                         2.63
==============================================================================

sub = sub3[sub3['income_level'] != "medium"]
model2 = smf.ols(formula='internetuserate ~ C(income_level)', data=sub).fit()
print (model2.summary())

OLS Regression Results
==============================================================================
Dep. Variable:        internetuserate   R-squared:                       0.765
Model:                            OLS   Adj. R-squared:                  0.763
Method:                 Least Squares   F-statistic:                     388.0
Date:                Tue, 01 Mar 2016   Prob (F-statistic):           2.94e-39
Time:                        11:01:42   Log-Likelihood:                -502.99
No. Observations:                 121   AIC:                             1010.
Df Residuals:                     119   BIC:                             1016.
Df Model:                           1
Covariance Type:            nonrobust
==========================================================================================
                             coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
Intercept                 66.1814      2.012     32.893      0.000        62.197    70.165
C(income_level)[T.low]   -55.8170      2.834    -19.697      0.000       -61.428   -50.206
==============================================================================
Omnibus:                       14.498   Durbin-Watson:                   1.792
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               22.232
Skew:                          -0.580   Prob(JB):                     1.49e-05
Kurtosis:                       4.750   Cond. No.                         2.63
==============================================================================

sub = sub3[sub3['income_level'] != "low"]
model3 = smf.ols(formula='internetuserate ~ C(income_level)', data=sub).fit()
print (model3.summary())

OLS Regression Results
==============================================================================
Dep. Variable:        internetuserate   R-squared:                       0.511
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     125.6
Date:                Tue, 01 Mar 2016   Prob (F-statistic):           2.18e-20
Time:                        11:02:21   Log-Likelihood:                -524.39
No. Observations:                 122   AIC:                             1053.
Df Residuals:                     120   BIC:                             1058.
Df Model:                           1
Covariance Type:            nonrobust
=============================================================================================
                                coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                    66.1814      2.317     28.558      0.000        61.593    70.770
C(income_level)[T.medium]   -36.4350      3.251    -11.208      0.000       -42.871   -29.999
==============================================================================
Omnibus:                        4.358   Durbin-Watson:                   2.021
Prob(Omnibus):                  0.113   Jarque-Bera (JB):                3.846
Skew:                          -0.419   Prob(JB):                        0.146
Kurtosis:                       3.235   Cond. No.                         2.64
==============================================================================

It appears that this conclusion holds true even across the different income levels (e.g. there is a strong difference between the internet use of the countries with medium income and that of the countries with high income). The corresponding p-values are also very low, meaning that all these results are scientifically robust (it is extremely unlikely to be due to chance). Of course this doesn't mean that there is a causal relationship between the two variables.

gapminder.csv
File Size:	30 kb
File Type:	csv

Download File

assignment_1.py
File Size:	2 kb
File Type:	py

Download File

0 Comments

FOXY DATA SCIENCE
unconventional insights about data science, A.I., cybersecurity, data analytics, and more