Sunday, 29 June 2008

Last.fm's API, Python, and tagging behaviour (Part 2)

Update: (2008/08/25) I fixed the same kind of bug I had in the previous post on this topic. While fixing it I decided to rerun it with a sample of 5000 instead of 2000 users. The code is fixed and the data is updated.

Last night I was a bit tired and quickly concluded "the numbers show...". But do they really?

To find out I use Python to do some simple statistical analysis to add some weight to the claim that older Last.fm users have a larger vocabulary and tag items more frequently.

First, I compute 95% confidence intervals for the percentage of non-taggers in each age group. Seeing the large margins (in the table below) helps explain why the age group 25-30 has a higher percentage than the age group 19-22. However, it doesn’t help explain why the age group 22-25 has a lower percentage. (I’d blame that on the relatively small and skewed sample, and I’d argue that they are still reasonably similar, both in average age and deviations of the percentages). (With the larger sample size this is not the case any longer.)

Computing the confidence interval is very easy. A user is either a tagger or not. The probability within an age group can thus be modeled with a Bernoulli distribution. The 95% confidence intervals for a Bernoulli distribution can be computed with:
z = 1.96 # norminv(0.975) for a 95% confidence interval
def binom_confidence(p,n):
if n*p*(1-p) >= 5:
return z*(p*(1-p)/n)**0.5

Btw, I couldn’t find the equivalent to the Matlab norminv function in Python. Any pointers would be appreciated!

To test the hypothesis that the vocabulary size of a tagger depends on her or his age I test the following: Given my observations, are all age groups likely to have the same vocabulary size, i.e, are the differences I observed just random noise? Since the distributions within each age group are far from Gaussian I can’t use a standard ANOVA. Instead I use the non-parametric version of a one-way ANOVA which is the Kruskal-Wallis test. In particular, I use the test to compute a p-value. The p-value is the probability that I would have made the same observation if the hypothesis that there is no difference between age groups would be true. (Thus smaller p-values are better. Usually one would expect at least a value below 0.05 before accepting an alternative hypothesis.) In this case the resulting p-value is nice and low indicating that it's extremely unlikely that older users don't have larger vocabularies.

Here are the results, and below is the Python code.

age || % non taggers || tagger's median
vocabulary size
14-19 || 41.3-48.1 || 6
19-22 || 37.5-43.6 || 7
22-25 || 40.6-47.3 || 9
25-30 || 34.8-41.4 || 8
30-60 || 28.4-36.0 || 13
Kruskal-Wallis p-value: 1.04e-008


from scipy.stats.stats import kruskal
from numpy import asarray

def print_stats(age_tags):
age_groups = (14,19,22,25,30,60)
ll = []
print "age || % non taggers || " + \
"tagger's median \n" + \
" vocabulary size"
for i in xrange(0,len(age_groups)-1):
nonzeros = [];
zero_count = 0
for j in xrange(age_groups[i],age_groups[i+1]):
for item in age_tags[j]:
if item!=0:
nonzeros.append(item)
else:
zero_count += 1
conf = binom_confidence(
zero_count/float(zero_count+len(nonzeros)),
zero_count+len(nonzeros))
ll.append(nonzeros);
print \
"%d-%d || %8.1f-%.1f || %2d" % \
(age_groups[i],age_groups[i+1],
(zero_count/float(max((1,len(nonzeros)+\
zero_count)))-conf)*100,
(zero_count/float(max((1,len(nonzeros)+\
zero_count)))+conf)*100,
median(nonzeros))
p = kruskal(*(asarray(ll[i]) for i in
xrange(len(age_groups)-1)))[1]
print "Kruskal-Wallis p-value: %.2e" % p

Btw, that part where I use eval to convert my lists into function arguments could hardly be any uglier. I’m sure there must be a better way of doing that? (Thanks Klaas!)

2 comments:

Klaas Bosteels said...

Yes, there is a better way :) As illustrated by this example, you can use the elements of an iterable as arguments of a function by putting a star (*) in front of it:

>>> def plus(x,y): return x+y
...
>>> plus(*[1,2])
3

So something like this is probably what you need:

kruskal(*(asarray(ll[i]) for i in xrange(len(age_groups)-1)))

Elias said...

Thanks Klaas! :-)