Based on the Gapminder data, there appears to be a fairly large variance in the internet use rate around the globe, as well as in the income per person. Could there be a significant signal there? Let's find out. Our hypothesis is: "There is a measurable difference in the internet use rate, in relation to the income level of a country". Obviously, the null hypothesis in this case would be "internet use rate is the same across all income levels". But first let's define what income level is. The income per person variable is continuous and has several missing values. Once the latter are eliminated, we can calculate the trisection points, corresponding to the 33rd and 67th percentile. This will allow us to split the variable into three more or less equal parts, which we can call "low", "medium", and "high". The first part of our code shows precisely that: # AUXILIARY STUFF def categorize(x, m1, m2): if np.isnan(x): return np.NaN if x <= m1: return "low" if x > m2: return "high" return "medium" # MAIN METHOD # Initialization import numpy as np import pandas as pd import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi data = pd.read_csv('gapminder.csv', low_memory=False) #Set variables to be numeric data["incomeperperson"] = data["incomeperperson"].convert_objects(convert_numeric=True) data["internetuserate"] = data["internetuserate"].convert_objects(convert_numeric=True) #Handle missing data sub1 = data[["incomeperperson", "internetuserate"]] # select only relevant columns sub1['incomeperperson'] = data['incomeperperson'].replace("", numpy.nan) sub1['internetuserate'] = data['internetuserate'].replace("", numpy.nan) #Break down incomer per person into 3 groups X = np.array(sub1["incomeperperson"].dropna()) m1, m2 = np.percentile(X, [33 ,67]) # trisection points of incomeperperson variable n = np.size(sub1, 0) sub1["income_level"] = [""] * n for i in xrange(n): sub1["income_level"][i] = categorize(sub1["incomeperperson"][i], m1, m2) Now it's time to deploy a (Stats) model to examine the relationship between our variables (internetuserate and income_rate). Based on the nature of the variables, we choose to use the basic ANOVA model, as seen in the following code: #Perform ANOVA on income_level and internetuserate variables to test initial hypothesis sub3 = sub1[['internetuserate', 'income_level']].dropna() model0 = smf.ols(formula='internetuserate ~ C(income_level)', data=sub3).fit() print (model0.summary()) OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.681 Model: OLS Adj. R-squared: 0.678 Method: Least Squares F-statistic: 192.3 Date: Tue, 01 Mar 2016 Prob (F-statistic): 2.10e-45 Time: 10:53:57 Log-Likelihood: -764.25 No. Observations: 183 AIC: 1534. Df Residuals: 180 BIC: 1544. Df Model: 2 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 66.1814 2.051 32.267 0.000 62.134 70.229 C(income_level)[T.low] -55.8170 2.889 -19.322 0.000 -61.517 -50.117 C(income_level)[T.medium] -36.4350 2.877 -12.664 0.000 -42.112 -30.758 ============================================================================== Omnibus: 6.163 Durbin-Watson: 2.014 Prob(Omnibus): 0.046 Jarque-Bera (JB): 7.413 Skew: -0.240 Prob(JB): 0.0246 Kurtosis: 3.861 Cond. No. 3.76 ============================================================================== From this summary, it is clear that the F-statistic is really high (over 190), something that is reflected in the p-value of the model (2e-45). It is clear that even at a very low alpha threshold (e.g. 0.001), this result is significant. So, there is a strong relationship between the income level of a country's citizens and their use of the internet. But does this hold across the different values of the income_level variable? The above model cannot say. If we want to find out... To do that we apply the ANOVA model again, this time for each pair of the income_level variable: sub = sub3[sub3['income_level'] != "high"] model1 = smf.ols(formula='internetuserate ~ C(income_level)', data=sub).fit() print (model1.summary()) OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.331 Model: OLS Adj. R-squared: 0.326 Method: Least Squares F-statistic: 59.99 Date: Tue, 01 Mar 2016 Prob (F-statistic): 3.25e-12 Time: 11:01:08 Log-Likelihood: -497.03 No. Observations: 123 AIC: 998.1 Df Residuals: 121 BIC: 1004. Df Model: 1 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 10.3644 1.777 5.834 0.000 6.847 13.882 C(income_level)[T.medium] 19.3820 2.502 7.746 0.000 14.428 24.336 ============================================================================== Omnibus: 8.935 Durbin-Watson: 1.969 Prob(Omnibus): 0.011 Jarque-Bera (JB): 8.830 Skew: 0.636 Prob(JB): 0.0121 Kurtosis: 3.324 Cond. No. 2.63 ============================================================================== sub = sub3[sub3['income_level'] != "medium"] model2 = smf.ols(formula='internetuserate ~ C(income_level)', data=sub).fit() print (model2.summary()) OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.765 Model: OLS Adj. R-squared: 0.763 Method: Least Squares F-statistic: 388.0 Date: Tue, 01 Mar 2016 Prob (F-statistic): 2.94e-39 Time: 11:01:42 Log-Likelihood: -502.99 No. Observations: 121 AIC: 1010. Df Residuals: 119 BIC: 1016. Df Model: 1 Covariance Type: nonrobust ========================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------------ Intercept 66.1814 2.012 32.893 0.000 62.197 70.165 C(income_level)[T.low] -55.8170 2.834 -19.697 0.000 -61.428 -50.206 ============================================================================== Omnibus: 14.498 Durbin-Watson: 1.792 Prob(Omnibus): 0.001 Jarque-Bera (JB): 22.232 Skew: -0.580 Prob(JB): 1.49e-05 Kurtosis: 4.750 Cond. No. 2.63 ============================================================================== sub = sub3[sub3['income_level'] != "low"] model3 = smf.ols(formula='internetuserate ~ C(income_level)', data=sub).fit() print (model3.summary()) OLS Regression Results ============================================================================== Dep. Variable: internetuserate R-squared: 0.511 Model: OLS Adj. R-squared: 0.507 Method: Least Squares F-statistic: 125.6 Date: Tue, 01 Mar 2016 Prob (F-statistic): 2.18e-20 Time: 11:02:21 Log-Likelihood: -524.39 No. Observations: 122 AIC: 1053. Df Residuals: 120 BIC: 1058. Df Model: 1 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 66.1814 2.317 28.558 0.000 61.593 70.770 C(income_level)[T.medium] -36.4350 3.251 -11.208 0.000 -42.871 -29.999 ============================================================================== Omnibus: 4.358 Durbin-Watson: 2.021 Prob(Omnibus): 0.113 Jarque-Bera (JB): 3.846 Skew: -0.419 Prob(JB): 0.146 Kurtosis: 3.235 Cond. No. 2.64 ============================================================================== It appears that this conclusion holds true even across the different income levels (e.g. there is a strong difference between the internet use of the countries with medium income and that of the countries with high income). The corresponding p-values are also very low, meaning that all these results are scientifically robust (it is extremely unlikely to be due to chance). Of course this doesn't mean that there is a causal relationship between the two variables.
0 Comments
Your comment will be posted after it is approved.
Leave a Reply. |
Zacharias Voulgaris, PhDPassionate data scientist with a foxy approach to technology, particularly related to A.I. Archives
December 2022
Categories
All
|