Sunday 29 June 2008

Last.fm's API, Python, and tagging behaviour

Update: (2008/08/25) I fixed the bug pointed out by thisfred. And I noticed that what I thought was the percentage of non-taggers was actually the ratio of non-taggers vs taggers... I changed that now.

My colleagues completely redesigned the Last.fm API. Inspired by their efforts and all the amazing things the Last.fm community has already built with the old API I decided that I wanted to try doing something with the API as well. The first thing that came to my mind was to use the public API to show that younger people have a smaller tagging vocabulary than older people. I couldn't figure out how to get a user's age from the new API so I used the old one. Anyway, here are the results and I also included the Python script I used. (Btw, any feedback on my Python coding is very welcome, I'm still very much a Python newbie.)

I crawled about 2000 users starting with RJ as seed. The first column is the age group, the second column is the ratio of users who haven't used any tags vs number of users who have used tags, the last number is the median number of unique tags which users who have applied tags have used.

14-19: zeros: 0.44 (120/155), median tags: 6
19-22: zeros: 0.42 (153/215), median tags: 6
22-25: zeros: 0.38 (114/184), median tags: 9
25-30: zeros: 0.43 (141/188), median tags: 10
30-60: zeros: 0.31 (79/179), median tags: 11

The numbers show that older users tag more and apply more unique tags.

from xml.dom import minidom
from urllib import quote, urlopen
from time import sleep
from numpy import median
from collections import defaultdict

seed = 'RJ' # start with Last.fm's CTO
MAX_RETRIES_URL_OPEN = 5

def get_xml(url):
for i in xrange(MAX_RETRIES_URL_OPEN):
try:
sleep(1) # be nice!
return minidom.parse(urlopen(url))
except IOError:
print "(%d/%d) Failed trying to get: %s." % \
(i, MAX_RETRIES_URL_OPEN, url)

def get_friends(user, friends, ignore_friends):
url = u'http://ws.audioscrobbler.com/1.0/user/' \
+ quote(user) + u'/friends.xml'
xmldoc = get_xml(url)
xmlusers = xmldoc.getElementsByTagName("user")
for user in xmlusers:
u = user.getAttribute("username")
if u not in ignore_friends:
friends.add(u)
print "%d/%d" % (len(friends), len(ignore_friends))
return friends

def get_age(user):
''' returns zero if user has not set his or her age '''
url = u'http://ws.audioscrobbler.com/1.0/user/' \
+ quote(user) + u'/profile.xml'
xmlage = get_xml(url).getElementsByTagName("age")
if len(xmlage)==0: return 0
return int(xmlage[0].firstChild.nodeValue)

def get_tags(user):
url = u'http://ws.audioscrobbler.com/1.0/user/' \
+ quote(user) + u'/tags.xml'
return len(get_xml(url).getElementsByTagName("tag"))

def print_stats(age_tags):
age_groups = (14,19,22,25,30,60)
for i in xrange(0,len(age_groups)-1):
nonzeros = [];
zero_count = 0
for j in xrange(age_groups[i],age_groups[i+1]):
for item in age_tags[j]:
if item!=0:
nonzeros.append(item)
else:
zero_count += 1
print \
"%d-%d: zeros: %.2f (%d/%d), median tags: %d" % \
(age_groups[i],age_groups[i+1],
zero_count/max((1,float(len(nonzeros)+ \
zero_count))),
zero_count, len(nonzeros), median(nonzeros))


users_notvisited = set([seed])
users_visited = set()

while len(users_notvisited)>0 and \
len(users_notvisited) + len(users_visited)<2000:
user = users_notvisited.pop()
if user not in users_visited:
users_notvisited = \
get_friends(user, users_notvisited, \
users_visited)
users_visited.add(user)

users = users_notvisited.union(users_visited)

age_tags = defaultdict(list)
i = 0
for user in users:
i += 1
print "%d/%d" % (i, len(users))
age_tags[get_age(user)].append(get_tags(user))
if i % 5 == 0:
print_stats(age_tags)

2 comments:

Unknown said...

*sigh* Thanks! Yet more confirmation that I'm getting long in the tooth: 3791 tags and counting ;)

The python code looks pretty good, there are some minor stylistic improvents possible which I'll send by email if I can figure out your address, because getting code to look ok in blog comments is beyond my skills.

Unknown said...

In the process of writing that email. There is one real bug, so for the benefit of everyone reading this post: don't do this:

> def get_friends(user, friends = set(),
> ignore_friends = set()):

It is a bad idea to initialize keyword arguments that are mutable types (i.e. lists, dicts, sets, etc. and class instances) in your function's 'def'. The details I'd have to look up, but there is a scope problem: they are initialized when the function definition is loaded, and not whenever the function is called. You get very surprising things like this:

>>> def foo(thing=[]):
... thing.extend([1])
... return thing
...
>>> foo()
[1]
>>> foo()
[1, 1]
>>> foo()
[1, 1, 1]