**Update:**(2008/08/25) I fixed the same kind of bug I had in the previous post on this topic. While fixing it I decided to rerun it with a sample of 5000 instead of 2000 users. The code is fixed and the data is updated.

Last night I was a bit tired and quickly concluded "the numbers show...". But do they really?

To find out I use Python to do some simple statistical analysis to add some weight to the claim that older Last.fm users have a larger vocabulary and tag items more frequently.

First, I compute 95% confidence intervals for the percentage of non-taggers in each age group. Seeing the large margins (in the table below) helps explain why the age group 25-30 has a higher percentage than the age group 19-22.

Computing the confidence interval is very easy. A user is either a tagger or not. The probability within an age group can thus be modeled with a Bernoulli distribution. The 95% confidence intervals for a Bernoulli distribution can be computed with:

z = 1.96 # norminv(0.975) for a 95% confidence interval

def binom_confidence(p,n):

if n*p*(1-p) >= 5:

return z*(p*(1-p)/n)**0.5

Btw, I couldn’t find the equivalent to the Matlab norminv function in Python. Any pointers would be appreciated!

To test the hypothesis that the vocabulary size of a tagger depends on her or his age I test the following: Given my observations, are all age groups likely to have the same vocabulary size, i.e, are the differences I observed just random noise? Since the distributions within each age group are far from Gaussian I can’t use a standard ANOVA. Instead I use the non-parametric version of a one-way ANOVA which is the Kruskal-Wallis test. In particular, I use the test to compute a p-value. The p-value is the probability that I would have made the same observation if the hypothesis that there is no difference between age groups would be true. (Thus smaller p-values are better. Usually one would expect at least a value below 0.05 before accepting an alternative hypothesis.) In this case the resulting p-value is nice and low indicating that it's extremely unlikely that older users don't have larger vocabularies.

Here are the results, and below is the Python code.

age || % non taggers || tagger's median

vocabulary size

14-19 || 41.3-48.1 || 6

19-22 || 37.5-43.6 || 7

22-25 || 40.6-47.3 || 9

25-30 || 34.8-41.4 || 8

30-60 || 28.4-36.0 || 13

Kruskal-Wallis p-value: 1.04e-008

from scipy.stats.stats import kruskal

from numpy import asarray

def print_stats(age_tags):

age_groups = (14,19,22,25,30,60)

ll = []

print "age || % non taggers || " + \

"tagger's median \n" + \

" vocabulary size"

for i in xrange(0,len(age_groups)-1):

nonzeros = [];

zero_count = 0

for j in xrange(age_groups[i],age_groups[i+1]):

for item in age_tags[j]:

if item!=0:

nonzeros.append(item)

else:

zero_count += 1

conf = binom_confidence(

zero_count/float(zero_count+len(nonzeros)),

zero_count+len(nonzeros))

ll.append(nonzeros);

print \

"%d-%d || %8.1f-%.1f || %2d" % \

(age_groups[i],age_groups[i+1],

(zero_count/float(max((1,len(nonzeros)+\

zero_count)))-conf)*100,

(zero_count/float(max((1,len(nonzeros)+\

zero_count)))+conf)*100,

median(nonzeros))

p = kruskal(*(asarray(ll[i]) for i in

xrange(len(age_groups)-1)))[1]

print "Kruskal-Wallis p-value: %.2e" % p