Sunday 29 June 2008

Last.fm's API, Python, and tagging behaviour (Part 2)

Update: (2008/08/25) I fixed the same kind of bug I had in the previous post on this topic. While fixing it I decided to rerun it with a sample of 5000 instead of 2000 users. The code is fixed and the data is updated.

Last night I was a bit tired and quickly concluded "the numbers show...". But do they really?

To find out I use Python to do some simple statistical analysis to add some weight to the claim that older Last.fm users have a larger vocabulary and tag items more frequently.

First, I compute 95% confidence intervals for the percentage of non-taggers in each age group. Seeing the large margins (in the table below) helps explain why the age group 25-30 has a higher percentage than the age group 19-22. However, it doesn’t help explain why the age group 22-25 has a lower percentage. (I’d blame that on the relatively small and skewed sample, and I’d argue that they are still reasonably similar, both in average age and deviations of the percentages). (With the larger sample size this is not the case any longer.)

Computing the confidence interval is very easy. A user is either a tagger or not. The probability within an age group can thus be modeled with a Bernoulli distribution. The 95% confidence intervals for a Bernoulli distribution can be computed with:
z = 1.96 # norminv(0.975) for a 95% confidence interval
def binom_confidence(p,n):
if n*p*(1-p) >= 5:
return z*(p*(1-p)/n)**0.5

Btw, I couldn’t find the equivalent to the Matlab norminv function in Python. Any pointers would be appreciated!

To test the hypothesis that the vocabulary size of a tagger depends on her or his age I test the following: Given my observations, are all age groups likely to have the same vocabulary size, i.e, are the differences I observed just random noise? Since the distributions within each age group are far from Gaussian I can’t use a standard ANOVA. Instead I use the non-parametric version of a one-way ANOVA which is the Kruskal-Wallis test. In particular, I use the test to compute a p-value. The p-value is the probability that I would have made the same observation if the hypothesis that there is no difference between age groups would be true. (Thus smaller p-values are better. Usually one would expect at least a value below 0.05 before accepting an alternative hypothesis.) In this case the resulting p-value is nice and low indicating that it's extremely unlikely that older users don't have larger vocabularies.

Here are the results, and below is the Python code.

age || % non taggers || tagger's median
vocabulary size
14-19 || 41.3-48.1 || 6
19-22 || 37.5-43.6 || 7
22-25 || 40.6-47.3 || 9
25-30 || 34.8-41.4 || 8
30-60 || 28.4-36.0 || 13
Kruskal-Wallis p-value: 1.04e-008


from scipy.stats.stats import kruskal
from numpy import asarray

def print_stats(age_tags):
age_groups = (14,19,22,25,30,60)
ll = []
print "age || % non taggers || " + \
"tagger's median \n" + \
" vocabulary size"
for i in xrange(0,len(age_groups)-1):
nonzeros = [];
zero_count = 0
for j in xrange(age_groups[i],age_groups[i+1]):
for item in age_tags[j]:
if item!=0:
nonzeros.append(item)
else:
zero_count += 1
conf = binom_confidence(
zero_count/float(zero_count+len(nonzeros)),
zero_count+len(nonzeros))
ll.append(nonzeros);
print \
"%d-%d || %8.1f-%.1f || %2d" % \
(age_groups[i],age_groups[i+1],
(zero_count/float(max((1,len(nonzeros)+\
zero_count)))-conf)*100,
(zero_count/float(max((1,len(nonzeros)+\
zero_count)))+conf)*100,
median(nonzeros))
p = kruskal(*(asarray(ll[i]) for i in
xrange(len(age_groups)-1)))[1]
print "Kruskal-Wallis p-value: %.2e" % p

Btw, that part where I use eval to convert my lists into function arguments could hardly be any uglier. I’m sure there must be a better way of doing that? (Thanks Klaas!)

Last.fm's API, Python, and tagging behaviour

Update: (2008/08/25) I fixed the bug pointed out by thisfred. And I noticed that what I thought was the percentage of non-taggers was actually the ratio of non-taggers vs taggers... I changed that now.

My colleagues completely redesigned the Last.fm API. Inspired by their efforts and all the amazing things the Last.fm community has already built with the old API I decided that I wanted to try doing something with the API as well. The first thing that came to my mind was to use the public API to show that younger people have a smaller tagging vocabulary than older people. I couldn't figure out how to get a user's age from the new API so I used the old one. Anyway, here are the results and I also included the Python script I used. (Btw, any feedback on my Python coding is very welcome, I'm still very much a Python newbie.)

I crawled about 2000 users starting with RJ as seed. The first column is the age group, the second column is the ratio of users who haven't used any tags vs number of users who have used tags, the last number is the median number of unique tags which users who have applied tags have used.

14-19: zeros: 0.44 (120/155), median tags: 6
19-22: zeros: 0.42 (153/215), median tags: 6
22-25: zeros: 0.38 (114/184), median tags: 9
25-30: zeros: 0.43 (141/188), median tags: 10
30-60: zeros: 0.31 (79/179), median tags: 11

The numbers show that older users tag more and apply more unique tags.

from xml.dom import minidom
from urllib import quote, urlopen
from time import sleep
from numpy import median
from collections import defaultdict

seed = 'RJ' # start with Last.fm's CTO
MAX_RETRIES_URL_OPEN = 5

def get_xml(url):
for i in xrange(MAX_RETRIES_URL_OPEN):
try:
sleep(1) # be nice!
return minidom.parse(urlopen(url))
except IOError:
print "(%d/%d) Failed trying to get: %s." % \
(i, MAX_RETRIES_URL_OPEN, url)

def get_friends(user, friends, ignore_friends):
url = u'http://ws.audioscrobbler.com/1.0/user/' \
+ quote(user) + u'/friends.xml'
xmldoc = get_xml(url)
xmlusers = xmldoc.getElementsByTagName("user")
for user in xmlusers:
u = user.getAttribute("username")
if u not in ignore_friends:
friends.add(u)
print "%d/%d" % (len(friends), len(ignore_friends))
return friends

def get_age(user):
''' returns zero if user has not set his or her age '''
url = u'http://ws.audioscrobbler.com/1.0/user/' \
+ quote(user) + u'/profile.xml'
xmlage = get_xml(url).getElementsByTagName("age")
if len(xmlage)==0: return 0
return int(xmlage[0].firstChild.nodeValue)

def get_tags(user):
url = u'http://ws.audioscrobbler.com/1.0/user/' \
+ quote(user) + u'/tags.xml'
return len(get_xml(url).getElementsByTagName("tag"))

def print_stats(age_tags):
age_groups = (14,19,22,25,30,60)
for i in xrange(0,len(age_groups)-1):
nonzeros = [];
zero_count = 0
for j in xrange(age_groups[i],age_groups[i+1]):
for item in age_tags[j]:
if item!=0:
nonzeros.append(item)
else:
zero_count += 1
print \
"%d-%d: zeros: %.2f (%d/%d), median tags: %d" % \
(age_groups[i],age_groups[i+1],
zero_count/max((1,float(len(nonzeros)+ \
zero_count))),
zero_count, len(nonzeros), median(nonzeros))


users_notvisited = set([seed])
users_visited = set()

while len(users_notvisited)>0 and \
len(users_notvisited) + len(users_visited)<2000:
user = users_notvisited.pop()
if user not in users_visited:
users_notvisited = \
get_friends(user, users_notvisited, \
users_visited)
users_visited.add(user)

users = users_notvisited.union(users_visited)

age_tags = defaultdict(list)
i = 0
for user in users:
i += 1
print "%d/%d" % (i, len(users))
age_tags[get_age(user)].append(get_tags(user))
if i % 5 == 0:
print_stats(age_tags)

Thursday 26 June 2008

Matlab, Python, and a Video

I've been using Matlab extensively for probably almost 10 years. I have written more lines of code in Matlab than in any other language. I always have at least one Matlab application window open. I've probably generated at least a few million Matlab figures (one of my most favorite Matlab functions is close all). I've written three small toolboxes in Matlab (and all of them have actually been used by people other than me). I've told anyone who was willing to listen that I couldn't have gotten even a fraction of my work done without Matlab. In fact, 3 times in a row I convinced the places I've been working at that I needed a (non-academic) license for Matlab and several of its toolboxes. I even had a Matlab sticker on my old laptop for a long time. I frequently visited the Matlab news group and I'm subscribed to several Matlab related blogs. If I would have needed to take a single tool with me on a remote island it would have been Matlab. I guess it's fair to say I was in love with Matlab.

However, I always felt that it wasn't a perfect relationship. Matlab is expensive. Matlab is not pre-installed on the Linux machines I remotely connect to. In fact, installing Matlab on Linux is a pain (compared to how easy it is to install it on Windows). Furthermore, not everyone has access to Matlab making it harder to share code. Finally, Matlab can be rather useless when it comes to things that are not best described in terms of matrices that fit into memory, and I can't easily run Matlab code on Hadoop.

I had been playing with Python out of curiosity (wondering why everyone was liking it so much) but I guess I was too happy with Matlab to seriously consider alternatives. But then Klaas showed me how to use Python with Hadoop. Within a very short time I've started to use Python more and more for things I usually would have done in Matlab. Now I write more Python code a day than Matlab code. I still use Matlab on a daily basis, but if I had to choose between Matlab and Python, it would be a very easy choice. SciPy and related modules are wonderful. If I'd redo my PhD thesis, it wouldn't include a single line of Matlab code and instead lot's of Python code :-)

Btw, James pointed me to the following visualization showing activities and shared code of Python developers over time. This is by far the best information visualization I have seen in a very long time. I really like the idea and implementation. I wonder if something similar could be done for a piece of music where the coders are replaced with instruments, and the files are replaced with sounds.


code_swarm - Python from Michael Ogawa on Vimeo.

Wednesday 25 June 2008

ISMIR'08 Student Travel Award

It's wonderful to see Sun Microsystems sponsoring student travel awards for this year's ISMIR. Submission deadline for applications is July 4th - very soon!

I highly recommend applying for an award even if it might seem like a bureaucratic burden. Sure, any student who gets an award will still need to find additional sources of funding. However, it's always easier to find smaller amounts of money, and as a researcher it is not unusual to spend a lot of time writing project proposals asking for grants. Student travel awards are a great way to start practicing! And writing one page yourself and asking your professor to write a recommendation is actually not a lot of effort. Btw, professors deal with recommendations very frequently, they shouldn't complain if you ask them to write you one :-)

I remember when I received a student travel award for the ACM KDD 2003: It was lots of fun because students who won the award also got a chance to participate in the organization. And helping in the organization of such a huge conference was a great experience.

Tuesday 24 June 2008

Late Night Thoughts

Last night I went to bed and fell asleep listening to a podcast that Paul recommended. I didn't stay awake for long but I remember that the interviewer seemed skeptical about some of the ideas of open research.

I guess it's natural to be skeptical when something is radically different to what we are used to. I wonder what Newton would have said if someone would have told him to publish or perish and that there are plenty of good journals that publish articles within a year of submission (including the review process). It took Newton almost 22 years to publish his findings, and that was not so unusual back in 1687.

Using those two data points (22 years in 1687 and 1 year in 2008) a simple linear model would suggest that a submit-publish cycle might take less than 3 months in 2020. While I can't see how that linear model is a good fit I could easily see how pushing a publication out within 3 months could be achieved with tools similar to what we know as blogs today.

Furthermore, looking at historic data it's also not too hard to see that unlike some might expect we will see fewer disputes with respect to who discovered what first.

Btw, tonight I'll try to stay awake as long as possible with this podcast about earworms.

Thursday 19 June 2008

More MIR related PhDs

I just updated the list of MIR PhDs. The most notable update is the thesis by Adam Berenzweig. Unfortunately his thesis is not publicly available for download, but it seems like he has found an answer to why we've been observing hubs using certain similarity measures for music. Quoting from his abstract: "A practical problem with this technique, known as the hub phenomenon, is explored, and we conclude that it is related to the curse of dimensionality."

I'm sure there's still plenty of dissertations missing in the list and I'll be happy to add any that are sent to me. (I'll also be happy to update any broken links or missing information...)

Btw, the list is slowly approaching 100 entries now. The Matlab script I wrote to generate the html files and statistics now takes almost 5 seconds to complete.

Friday 13 June 2008

Myths about Last.fm tags

Today I was pointed to the following: "Last.fm has thousands of tags, unfortunately they are all pretty bad." (A statement made in this video of a very interesting talk about autotagging and applications that can be built using tags, around minute 51.)

I think this needs some clarification: Last.fm has a lot more than just a few thousand tags. The 1,000,000th unique tag applied by a Last.fm user was earthbeat about half a year ago.

Related links: fun stats on Last.fm tags, Last.fm's multi-tag search.

Tuesday 10 June 2008

Machine Learning Rant

This rant is inspired by this wonderful blog post which I found through Greg Linden's blog.

Most people who've worked with me might know that I'm very skeptical about using machine learning algorithms. In most of my work I've avoided using them as a solution to a problem.

My problem with machine learning algorithms is the way they are used. I think a beautiful example to illustrate this failure to use them is genre classification. Countless papers have been published claiming around 80% classification accuracy. There have even been a number of papers indicating that these 80% are close to the disagreement level between humans (i.e. the machines 80% are as good as the genre classification performance of any human).

Anyone who has seriously looked at such trained genre classifiers in more detail will have wondered why the measured accuracy is almost perfect and yet the results on a new data set are often not so satisfactory. The simple solution to this specific problem is that instead of genre classification accuracy most researchers have been measuring the artist classification accuracy because training and test set often included pieces from the same artists and most pieces of an artist belong to the same genre. (I've been arguing for the use of an artist filter since 2005, and yet I still see lots of papers published which ignore the issue completely...)

Anyway, the point is that people using machine learning algorithms often consider their problem to be solved if their model performs well on their test set. I often have the impression that no effort is made to understand what the machine has learned.

Most of the time when I explain that I'm skeptic about machine learning I'm confronted with raised eyebrows. How can someone seriously challenge best practice methods? Surely anyone who is skeptic about machine learning has not understood what it is about?

Despite all the responses I've received so far, today after having read the wonderful blog post mentioned above, I feel like there are lots of people out there (many of which are surely a lot smarter than me) who are skeptic about the use of machine learning algorithms. The next time I'm trying to explain why I think that blindly using and trusting a machine learned model is not a solution I'll point to Google's ranking of search results :-)

Having said that, I think there is a lot of potential for machine learning algorithms that generate human readable explanations for the models they generate. In particular, in the same way any human data analyst uses experience, common sense, and data to justify any decision when building a model, I'd like to see machine learning algorithms do the same. In addition, it would be nice if (like a human learner) the algorithm could point to possible limitations which cannot be foreseen based solely on the given data.

I guess I should also add that all of this is just a matter of definitions. With machine learning I mean black boxes which people use without bothering to understand what happens inside them. In contrast to my definition, many consider statistical data mining to be an important part of machine learning (which I sometimes don't, because it requires specific domain knowledge, as well as human learning, understanding, and judgment). Furthermore, I have no doubts that Google applies machine learning algorithms all the time, and that combining machine learning with human learning is a very natural and easy step, unlike my above rant might indicate.

Monday 9 June 2008

Quaero

While trying to catch up with my emails I stumbled upon Geoffrey Peeters' mail in the Music-IR list. He still has 2 open positions in the Quaro project, which is a huge project I know almost nothing about. A quick Google search revealed that Quaro is French, it's worth 200M€, will last 5 years, has 24 partners, and is Latin for "I'm searching".

I guess only a tiny fraction of Quaero will be dealing with music and MIR research. However, a tiny fraction of 200M€ is still huge, and the topics Geoffrey mentions are very interesting: genre/mood classification, music similarity, chorus detection, ... Quaero also seems to have a strong emphasis on gathering and annotating data that can be used to evaluate and improve algorithms.

Btw, there is some more information about Quaero in this presentation.

I found it interesting that the administrative work on Quaero started in August 2005. Operational work (research etc) will start November 2008. I wonder how much overhead costs they have and how much flexibility they still have?

On slide #7 the concrete innovations are mentioned (and 5 examples are given). They mention detecting soccer goals in a recording of a soccer match, identifying the songs in a sound track, and using automatic translation to enable people to search in different languages. I wonder if they had to finalize the innovative outcomes of Quaero back in 2005?

Overall it seems like part of Quaero's strategy is to enter markets which have already been covered by many startups as well as large players such as Google. In particular, it almost seems like Quaro is intended to help Exalead survive a bit longer? (Exalead is a French search engine startup. I briefly tried their search and it seemed a bit slow. When searching for "music information retrieval" they fitted 1.25 proper search results on my 15.4" screen the rest was ads and other information I don't care about... I'm not surprised they aren't doing well - not even in France.)

Anyway, IRCAM is leading Quaero's music technology, so I expect we'll see lots of great outcomes.

Sunday 1 June 2008

Open Research

In response to what Paul just blogged about...

I'm curious how in the long run blogging about ongoing research will transform communication within the research community. I've already been curiously following blogs by some PhD students such as Mark or Yves who are very open about their ongoing work and ideas.

I don't think blogs can replace publishing papers at research conferences or journals. However, I wouldn't be surprised to see more and more references to blog posts in conference papers in the future.

Blogs would be the perfect communication platform for researchers if:

  1. there would be a guarantee that a blog post will be around for ever (i.e., that researchers in 20 years from now can go back and look at it)
  2. if it would not be possible to alter any information published on a blog (or at least to detect if something has been altered), this includes not being able to change the date when an idea or result was first published


It seems that one way to overcome both limitations would be to have authorities frequently crawl, store, and index blog posts related to research. Another option might be to have something like a "research mode" on popular blogging service providers such as blogger.com: if the blogger opts into this mode, then the researcher wont be able to ever change his blog posts (including deletion) and the blog posts will be indexed by search engines.

However, even as publishing preliminary research results on blogs becomes an accepted standard I wouldn't be surprised if some unfortunate researchers without ideas of their own consider an idea published on a blog to be not published at all and try to publish it at a conference with their own names on it without referencing the source. I'm sure though, that such attempts would ultimately fail as the blogging research community would point them out quickly.

I wouldn't be surprised if a few researchers starting to use blogs to communicate ongoing results will trigger a snowball effect. For example, if Paul starts blogging about ongoing research that someone else is currently working on, wouldn't that other researcher feel urged to publicly state that he or she is also working on the same topic? Otherwise, by the time this other person publishes results, everyone might think that those ideas were just copied from Paul's blog?

Looking at how research has developed over the past centuries, the direction we have consistently been heading in seems very obvious: more openness and getting results out faster. Research blogs seem like a very natural next step in the evolution of communication in the science community.