Last night I was a bit tired and quickly concluded "the numbers show...". But do they really?
To find out I use Python to do some simple statistical analysis to add some weight to the claim that older Last.fm users have a larger vocabulary and tag items more frequently.
First, I compute 95% confidence intervals for the percentage of non-taggers in each age group. Seeing the large margins (in the table below) helps explain why the age group 25-30 has a higher percentage than the age group 19-22.
Computing the confidence interval is very easy. A user is either a tagger or not. The probability within an age group can thus be modeled with a Bernoulli distribution. The 95% confidence intervals for a Bernoulli distribution can be computed with:
z = 1.96 # norminv(0.975) for a 95% confidence interval
def binom_confidence(p,n):
if n*p*(1-p) >= 5:
return z*(p*(1-p)/n)**0.5
Btw, I couldn’t find the equivalent to the Matlab norminv function in Python. Any pointers would be appreciated!
To test the hypothesis that the vocabulary size of a tagger depends on her or his age I test the following: Given my observations, are all age groups likely to have the same vocabulary size, i.e, are the differences I observed just random noise? Since the distributions within each age group are far from Gaussian I can’t use a standard ANOVA. Instead I use the non-parametric version of a one-way ANOVA which is the Kruskal-Wallis test. In particular, I use the test to compute a p-value. The p-value is the probability that I would have made the same observation if the hypothesis that there is no difference between age groups would be true. (Thus smaller p-values are better. Usually one would expect at least a value below 0.05 before accepting an alternative hypothesis.) In this case the resulting p-value is nice and low indicating that it's extremely unlikely that older users don't have larger vocabularies.
Here are the results, and below is the Python code.
age || % non taggers || tagger's median
vocabulary size
14-19 || 41.3-48.1 || 6
19-22 || 37.5-43.6 || 7
22-25 || 40.6-47.3 || 9
25-30 || 34.8-41.4 || 8
30-60 || 28.4-36.0 || 13
Kruskal-Wallis p-value: 1.04e-008
from scipy.stats.stats import kruskal
from numpy import asarray
def print_stats(age_tags):
age_groups = (14,19,22,25,30,60)
ll = []
print "age || % non taggers || " + \
"tagger's median \n" + \
" vocabulary size"
for i in xrange(0,len(age_groups)-1):
nonzeros = [];
zero_count = 0
for j in xrange(age_groups[i],age_groups[i+1]):
for item in age_tags[j]:
if item!=0:
nonzeros.append(item)
else:
zero_count += 1
conf = binom_confidence(
zero_count/float(zero_count+len(nonzeros)),
zero_count+len(nonzeros))
ll.append(nonzeros);
print \
"%d-%d || %8.1f-%.1f || %2d" % \
(age_groups[i],age_groups[i+1],
(zero_count/float(max((1,len(nonzeros)+\
zero_count)))-conf)*100,
(zero_count/float(max((1,len(nonzeros)+\
zero_count)))+conf)*100,
median(nonzeros))
p = kruskal(*(asarray(ll[i]) for i in
xrange(len(age_groups)-1)))[1]
print "Kruskal-Wallis p-value: %.2e" % p