MIR Research: 2008

Thursday, 18 September 2008

ISMIR 2008 Demos

One of the best parts about ISMIR was the demo session.

Paul Lamere and Francois Maillet demonstrated Explaura which enables users to directly interact with tag clouds by resizing individual tags for tag-based recommendations. Increasing the size of a tag puts more emphasis on it, shrinking it reduces the impact to the point where it's ignored. Increasing the "negative size" of a tag filters results by the respective tag. They also allow combining artists with tags in the search query. Paul blogged about it. There's also a short ISMIR abstract. I like the idea of interacting with the individual tags in a tag cloud and found it very intuitive to use.

Another demo which was presented (and I already previously blogged about) was the work of Martin Gasser and his colleagues on how they integrated audio-similarity into FM4 Soundpark (a platform for independent Austrian artists). The numbers they presented show that using audio-similarity helped Soundpark users find older and more obscure items in the catalogue. It would be nice to see more real world applications using audio-similarity.

Òscar Celma and Marcelo Nunes presented GeoMuzik which allows drawing a route on a world map. Their system then generates a playlist according to this route. They implemented genre/tag filters. They can also visualize the artists in my Last.fm profile on top of a map. There's an ISMIR abstract. I liked the demo very much and would like to play a bit more with it. The picture below shows a screenshot of a playlist generated from the interface.

Luke Barrington and his colleagues demonstrated their new tagging game Herd it which they implemented in the Facebook application framework. If I understood them correctly they still want to do some testing before releasing it publicly, so I'll write more when they are ready. Anyway, I've heard from others how difficult it can be to get something to work seamlessly on Facebook. I'm impressed that Luke and his colleagues are doing this. It would be nice to see researchers use Facebook and similar platforms more frequently for their work.

One demo I unfortunately didn't have enough time to see (but at least I got the handout) was the work Anita Lillie presented. There's also a video. In the video Anita shows different ways to visualize the same music collection from different perspectives using a PCA (principal component analysis, a linear projection of a high-dimensional space onto 2-dimensions). Seems like Anita just finished her MSc thesis on MusicBox: Navigating the space of your music. I've been told the demonstration was implemented in processing. Very nice! One thing I didn't see were playcounts. It would be great if playcounts (or ratings per track) could also be visualized (size of the circles?). The picture below shows a screenshot of her MusicBox.

Another demo I really liked was Claudio Baccigalupo's work on Poolcasting. The idea is to have several people tune into the same radio station at the same time, and to have them rate the songs, and use that rating to optimize the overall listening experience. I could easily see Gwen and myself sharing the same radio stream frequently. What I found particularly interesting is how Claudio tries to maximize happiness for everyone in a situation where compromises are unavoidable and where he does not want the majority to completely ruin the experience for an individual.

Btw, if you like demos, you might also be interested in Last.fm's playground.

Wednesday, 17 September 2008

MIREX 2008

This year's MIREX evaluation task has been one of my personal ISMIR 2008 highlights. Stephen Downie and his team computed more numbers than I could possible keep track of for lots of different algorithms in 12 different MIR tasks. That's a lot more than in any of the previous years, and that's a lot of interesting data to dig into.

I've been particularly interested in the auto-tagging task. It's the first time MIREX ran this form of task, and there have been only few research papers in the MIR community on the subject. As far as I understood there is no agreement yet as to how to exactly evaluate the algorithms, which is also reflected on the result page. Kris West has added information on the statistical significant of the results which show that none of the submissions was consistently and significantly better than others. Nevertheless, there's a lot to learn from the evaluation and I hope we'll see many more participants next year.

Paul has a good summary of the discussion of this year's MIREX panel.

1000 years of music to listen to

Youngmoo Kim had asked everyone on the ISMIR recommendation panel to briefly summarize what they think will happen in the next 5 years of music recommendation. However, it's really hard to do so in less than 4 minutes.

From my perspective the most interesting development in the next 5 years will be the increase in the amount of data we will be working with. We will have a lot more of the same and we will have additional sources. Combining different sources is an interesting challenge, but the main challenge will be to scale things up.

All of this additional information will lead to much better recommendations overall, and in particular in the long tail. We'll be able to detect new trends such as an up-and-coming artists or the emergence of a new subgenre much sooner. We'll be able to localize recommendations a lot more.

At the same time there'll obviously be a lot more music to choose from. I'd roughly estimate about 200 million tracks in Last.fm's recommendation engine in the next 5 years. That's more than 1000 years of continuous listening. Subcultures and genres will emerge faster.

In 5 years recommendation engines will have a much better understanding of listeners. While Last.fm, Pandora, and others already do a lot to understand what listeners are interested in, I'm sure there is room for a lot more improvements.

Another interesting development I'm looking forward to is data portability and openness. In particular, I'm looking forward to users being able to move freely with their personal data from one site to another. Similar to how Last.fm users can already today allow other sites to access their data.

I'm also expecting to see a lot more artists and labels embrace recommendation engines. Similar to SEO (search engine optimization) more artist and labels will try to do a lot more REO (recommendation engine optimization).

Obviously mobile applications will be very important, and so will mobile music recommendations. And I have no doubts that human-to-human recommendations (which are strongly supported by Last.fm) will continue to be very important, maybe even more than they are today.

Anthony Volodkin made a great point that we'll see a lot happen in terms of user interfaces, how recommendations are represented, how recommendations are explained. I believe Paul Lamere would call that steerable and transparent recommendations. I like how Last.fm explains recommendations by explaining a recommendation in terms of a bunch of similar artists I'm familiar with. However, there's obviously room for a lot more. On the other side, I wouldn't mind no explanation at all, as long as every recommendation is spot on. Anthony also made a great point by pointing to playful discovery systems.

I believe it was Brian Whitman who said that recommendations will be a commodity. Every music site will have recommendations. Just like almost every web 2.0 site out there supports tagging. I believe Etienne Handman made a similar point when he previously explained to me why he expects the word "personalization" to fade away. Everything will be personalized, it will be the default option.

Tuesday, 16 September 2008

ISMIR Thoughts

Inspired by Paul's constant flow of ISMIR blog posts I thought I should give it a try and post some thoughts as well.

First of all, the organizers did a wonderful job organizing ISMIR. I already wrote about what I think about the electronic proceedings. I was also very happy to see that they did not waste unnecessary resources and skipped the silly conference bag thing.

I very much enjoyed giving the social tags tutorial with Paul. It worked out really well, and although I knew Paul's slides since weeks, I found it fascinating to listen to Paul talk about them.

So far ISMIR has by far exceeded my expectations. I've had the pleasure to meet many in person that I previously hadn't had the opportunity to meet. I've also had the pleasure to see many very interesting posters. Unfortunately I've also managed to miss many that I wanted to see. I guess there's never enough time to see everything.

Last night's banquet was great too. In particular I enjoyed the conversations with Etienne from Pandora.

The recommendation panel today was fun, too.

Thursday, 11 September 2008

ISMIR 2008 Tutorial: Social Tags and Music Information Retrieval

When Paul originally asked me if I'd be interested in helping him put together a tutorial proposal for ISMIR I was a bit reluctant. I did an ISMIR tutorial a long time ago, and I didn't forget how much work it was (although it was only a one hour mini tutorial). In fact, only last year Paul and Oscar told me how much work it was to put together their recommendation tutorial (which I liked a lot).

Anyway, somehow I couldn't say no and looking back I don't regret it at all. I'm totally fascinated by social tags. There's no tutorial topic I'd rather talk about. Working together with Paul has been a great pleasure, and I've learned a lot. However, I'm very much looking forward to have a free weekend or even just a free evening again. (Free as in not-MIR-related.)

Paul and I have put together about 200 slides for our tutorial. I think we still need to figure out our time budget. Hopefully we'll have plenty of time for interesting discussions.

Btw, as part of the tutorial we started compiling a list of relevant papers. That list started to grow. Then we thought it would be a good idea to group papers into topics (e.g. autotagging). Then we realized that several papers were in several categories... which is when we moved everything to delicious. In particular, we started using the tag "SocialMusicResearch" to mark interesting things we found in the Internet. Here's a list of some of the items we have tagged. We hope that others will start using that tag as well.

Btw, if you are going to ISMIR, please say hi! If you don't know what I look like: try to spot the guy wearing a Last.fm t-shirt.

Wednesday, 27 August 2008

ISMIR Proceedings 2008! Wow!

I'm extremely impressed. The ISMIR proceedings are online. Whoever wants a printed copy can organize it themselves (it couldn't be much easier). Some might also want to only print the papers they are interested in. And some might be happy to have only an electronic version.

It's always been a pain to drag the heavy ISMIR proceedings home. And it always felt like a huge waste of paper.

I heard Juan and Youngmoo talk about this idea a year ago in Vienna (at last year's ISMIR). I'm very happy to see that they found a solution that should make everyone happy.

Juan writes in his email:
We hope that you will like this new approach to printing the proceedings which we intend to be more cost effective, more convenient, and, with luck, more environmentally friendly than mass printing of proceedings for all attendees who may not wish to carry a printed copy around.

Wonderful! :-)

Monday, 25 August 2008

Librarians and Tags

Yves pointed me to this really nice presentation by a librarian who seems to have a really good understanding of tagging. The only part missing in that presentation is music.

Libraries and the Hive Mind: Folksonomies and Tagging

View SlideShare presentation or Upload your own. (tags: tagging tags)

Getting Last.fm Tags for MP3s with Python

Paul (who is already distributing a large chunk of Last.fm tags) and I are planing to include a few slides in our ISMIR tutorial on how to obtain tag data.

Below is some Python code that basically takes an MP3 file as input and outputs a list of Last.fm tags (for both artist and track). The MP3s don't need correct ID3 tags, but they need to be full length (clips won't work).

The Python code uses Norman's command line finger printing client to find the correct artist and track name. The path to the executable needs to be set in the code. Norman supports Win32, OSX Intel, Linux - 32.

The output is written to a file. For each MP3 file passed as argument there are up to two rows in the output file: one for the artist tags, and one for the track tags. Each row has the format: "<mp3filename> <encoded artist or artist/track name> <tag> <score> [<tag> <score> ...]". Tabs are used as delimiters.

The data from the Last.fm API is available under the Creative Commons Attribution-Non-Commercial-Share Alike License.

Btw, special thanks to Eric Casteleijn for various Python recommendations (lxml etc). (Which reminds me that I still need to fix the other Python code I posted.) As usual any feedback is much appreciated.

import subprocess, sys, re, time, urllib
from lxml import etree

FP_CLIENT_PATH = '"C:\\fpclient\\lastfmfpclient.exe"'
MAX_RETRIES_URL_OPEN = 5

def getArtistTrack(mp3FileName): # ret: (artist, track)
    command = FP_CLIENT_PATH + ' ' + mp3FileName
    pipe = subprocess.Popen(command, \
                            stdout=subprocess.PIPE).stdout
    for line in pipe:
        mo = re.search('<url>.*/([^/]+)/_/(.+)<',line)
        if mo:
            return urllib.quote(mo.group(1)), \
                   urllib.quote(mo.group(2))
    print "ERROR: failed to get artist/track for: " + \
          mp3FileName

def crawlTags(url): # ret: [(tag, count), ...]
    for i in xrange(MAX_RETRIES_URL_OPEN):
        tagCounts = []
        time.sleep(1) # be nice!
        try:
            root = etree.parse(
                urllib.urlopen(url)).getroot()
        except IOError:
            print "(%d/%d) Failed trying to get: %s." % \
                (i, MAX_RETRIES_URL_OPEN, url)
        else:
            for tag in root.iter('tag'):
                tagCounts.append(
                    (tag.find('name').text, \
                     tag.find('count').text))
            return tagCounts

def tags(prefix, items, outStream): # crawl and write
    for mp3FileName, item in items:
        url = prefix + item + '/toptags.xml'
        print url
        tagCounts = crawlTags(url)
        outStream.write('%s\t%s\t%s\n' %
                (mp3FileName, item, '\t'.join(
                    tag + '\t' + str(count)
                    for tag, count in tagCounts)))    

def main():
    if len(sys.argv)<3:
        print 'USAGE: python getTags.py ' + \
              '<outFile> <f1.mp3> [<f2.mp3> ...]'
        sys.exit(2)
    outFile = sys.argv[1]
    mp3FileNames = sys.argv[2:]
    artists = set()
    artistTracks = set()
    for mp3FileName in mp3FileNames:
        print 'Fingerprinting: ' + mp3FileName
        artist,track = getArtistTrack(mp3FileName)
        artists.add((mp3FileName, artist))
        artistTracks.add((mp3FileName,
                          artist + '/' + track))

    print 'start crawling tags'
    o = open(outFile,'w');
    tags('http://ws.audioscrobbler.com/1.0/artist/', \
         artists, o)
    tags('http://ws.audioscrobbler.com/1.0/track/', \
         artistTracks, o)    
    o.close()    

if __name__ == "__main__":
    main()

Sunday, 24 August 2008

Tagging Critics

I was doing some research for the ISMIR tag tutorial when I stumbled upon (via this interesting paper Playing Tag: An Analysis of Vocabulary Patterns and Relationships Within a Popular Music Folksonomy by Abbey E. Thompson):

The following expert from this paper:

[...] "tags are often ambiguous, overly personalised and inexact" [...] "The result is an uncontrolled and chaotic set of tagging terms that do not support searching as effectively as more controlled vocabularies do." [...]

This was published in the D-Lib magazine in early 2006. I wouldn't be surprised if by now the authors realized they were wrong.

But why would anyone ever want to control the vocabulary people use when describing something so extremely multifaceted and something that evolves so fast like the content on the web (delicious), or snapshots of life (flickr), or music (Last.fm)? I guess I'd need to think more like an old-skool librarian to understand that.

Saturday, 23 August 2008

Tagging Games (Callabio on Facebook)

Did Microsoft researchers clone the scoring system MajorMiner uses for a tagging game on Facebook? Well, maybe it's not identical, but it seems kind of similar. And I guess it could be lots of fun.

Btw, if scoring points is what drives people to play that game, what would stop them from entering a whole dictionary?

UK Hadoop User Group Meeting

Last Tuesday the first UK Hadoop user group meeting took place in London. Johan did a great job in organizing it and speakers included Doug Cutting from Yahoo! (who leads the Hadoop project).

Here's are some links:

Yahoo! developer network blog mentioning the event. (Includes a video of interviews with some of the presenters.)
Doug Cutting: Hadoop overview
Tom White: Hadoop Web Services on Amazon S3/EC2
Steve Loughran: Deploying Apache Hadoop with Smartfrog
Mark Butler: Distributed Lucene for Hadoop
Last.fm related talks:

Seems like the interesting talk by Miles Osborne on "Using Nutch and Hadoop for Natural Language Processing" is still missing on skills matter's website. I'll update this blog post when the talk is added.

Saturday, 2 August 2008

More MIR related PhDs

I've added 5 dissertations to the incomplete list of MIR related PhDs. I'm particularly happy that Markus, Tomoyasu, and Kazuyoshi (all of them are former colleagues of mine) finished their thesis so successfully.

Tomoyasu Nakano recently finished his thesis on "A Study on Developing Applications Systems Based on Singing Understanding and Singing Expression" (in Japanese). He has now joined Masataka Goto's research group as a postdoc and is working on the CrestMuse project. He recently received a lot of attention for his work on optimizing Vocaloid parameters.

Kazuyoshi Yoshii recently finished his thesis on "Studies on Hybrid Music Recommendation Using Timbral and Rhythmic Features". Kazuyoshi was awarded a tenure position at the AIST which is pretty impressive. He joined Masataka's group and is working on CrestMuse.

Markus Schedl recently finished his thesis on "Automatically Extracting, Analyzing, and Visualizing Information on Music Artists from the World Wide Web". As part of his thesis he crawled a very impressive number of web pages to build a retrieval system for 600,000 artists. In the next months he'll be finishing his business studies.

Steven Travis Pope's thesis is already a bit older (2005) and serves as perfect example of how incomplete the list of PhDs is which I maintain. His work was on "Software Models and Frameworks for Sound Composition, Synthesis, and Analysis: The Siren, CSL, and MAK Music Languages".

The 5th PhD thesis I added to the list is Matt Wright's work on "Computer-Based Music Theory and Acoustics" which he completed at CCRMA, Stanford University.

Saturday, 26 July 2008

Creepy Recommendations

I just got some recommendations from an algorithm that were so good that it was creepy. (One of the recommendations was this video.)

It made me realize how such recommendations can be a nice shortcut for a machine into someones heart. (Although it only takes a few wrong recommendations to be kicked out again.)

I wonder if in the near future I'll have gotten used to the idea that an algorithm attached to my attention profile data will know me better than any human being could (and I'm not just talking about music).

Friday, 25 July 2008

The New Last.fm

I guess this post is a bit off topic. But I’d argue that Last.fm is one of the main MIR related web sites out there, and I found the launch of the new site very exciting.

Here’s a link to the main announcement on our blog (with over 2000 immediate responses from users, most of them are negative). The blog post includes links to the forums where the feedback continued after the comments needed to be closed.

Here’s a Last.fm group with over 11,000 users asking to bring back the old design. An interesting read are also the forums of that group. Some user's have even worked on ways (e.g. using greasemonkey) to bring back the old look and feel.

Here is one of many youtube videos of people complaining.

And there are even some conspiracy theories.

The negative feedback was being voiced in many different places. Including, for example, in the comment section of an article by the Times Online which focused on how the changes related to advertising.

(Btw, there’s also been a lot of positive feedback, too. For example, there are some positive comments in this digg article.)

When I joined Last.fm over a year ago there were like a million things I thought that could be improved on the site. I was bugging those in charge of the web page design on a daily basis. However, they were already aware of almost everything I was pointing out, and explained that there will be a major redesign coming, and that things will be fixed then. As time went by I started to appreciate the complications of making changes to the old site. It was a site that had grown very quickly in many different directions that weren’t fitting together perfectly anymore.

So more than a year ago my colleagues had started making plans on how they would design the perfect Last.fm if they could start from scratch. I’d also like to mention that almost all of my colleagues are hardcore Last.fm users (and have lots of friends who are Last.fm users, and spend time talking to their moms (or other less technology savvy users) about what difficulties they might have using Last.fm). I think it’s fair to say that most of my colleagues have an excellent understanding of the various issues related to the user experience.

However, at the same time the old site was still in full development. A lot of new features were being launched, integrated, bugs fixed, etc. Maintaining the old site was a full time job for a small (but quickly expanding) team, and my colleagues were spending any free minute they had on completely redesigning Last.fm. I’d also like to add that when they talked about change, it wasn’t simply the design, navigation, and structure that were being considered. A lot of changes involved some serious backend changes and new features. It really was all about making the dream of a perfect Last.fm come true.

Btw, here's an interesting podcast where Hasso Plattner (founder of SAP) talks (among many other things) about the challenges of developing the next version of a product while maintaining the previous version. I'm pretty proud of how my colleagues have mastered this challenge.

When the site went live everyone knew that there was still plenty of room for improvements, but at the same time it was clear that the benefits would largely outweigh the remaining issues. And we also knew that even if the new site were absolutely perfect from the start (which it obviously wasn't), it wouldn't be easy for those who were able to use the old site blindfolded. This included myself: there were several moments of serious frustration where I knew what I wanted to do on the old site, but couldn’t instantly figure out how to do it on the new one. (Btw, continuing to maintain two sites was not really an option.)

Despite all the negative feedback we received (and despite all the effort my colleagues are currently putting into fixing issues the Last.fm community has pointed out), there are already several indicators that the new site might be an even bigger success than we would have hoped for, and most of all it has paved the way for a lot more to come. It’s never been more fun to work at Last.fm! Btw, check out the jobs at Last.fm :-)

Monday, 21 July 2008

Recommended Book: Probability and Statistics

The last two weeks I was camping north of London. Because I didn't want to drag a whole library with me I decided to take along only one book on statistics. From the books next to my bed I selected the lightest one which is Schaum's Outline: Probability and Statistics. And it was one of the best packing decisions I made.

It all started in one of my favorite book stores in London where I almost ignored the book in the first place. The book stuck out in the shelf of statistics books because of it's height and because of it's ugly front cover design. The pages felt like those of a telephone book. My expectations were as low as it's price tag (which was £12) and I ignored it. However, the shop only offered a very limited selection of books on statistics. So eventually I turned back to it out of curiosity wondering how bad a book on statistics could be. And then I stumbled upon one of the many interesting problems in the book and tried to solve it, and then I found the next interesting problem, and decided to move to the attached cafe. By the time the store was about to close I didn't want to part with the book. (One example of the fun problems in the book is: given 6 randomly sampled observations from a continuous population what is the probability that the last 2 are higher than the first four? (One way to solve it is to use calculus another way is to use combinatorics.))

The book does not only feature the ugliest front cover of any of the books I ever owned, it also contains many typos. And every time I tried to use the index it seemed to point me to random pages. For example, one typo can be found in the introduction to the multinomial distribution where they forgot an important exclamation mark. I don't understand how they manged to include so many errors in this second edition. In fact sometimes I wondered if errors were included to keep the reader alert. However, none of the errors I found were hard to identify as such. (In the case of the multinomial distribution formula there is an example just a few lines below the typo which uses the correct formula.)

Overall the book is amazing. I had a hard time choosing between packing up my tent in the rain or waiting for the rain to stop while reading a few more pages and drinking some hot tea (btw, I also highly recommend Trangia stoves).

The book is definitely suitable for people (like me) who work with probabilities and statistics on a daily basis but feel like they lack a solid foundation. It helps if you've had a basic course on statistics a long time ago and just want to refresh your knowledge. However, I think it is also largely and easily accessible to anyone who has not had any courses on statistics (although some understanding of calculus will help a lot).

The best part of the book is that it features lots of practical problems that help understand the theoretical concepts. The book is also structured in a way that makes it very easy to spend 30 minutes or less at a time with it. The topics covered include nonparametric tests, curve fitting, regression, and hypothesis testing.

Btw, I can also recommend: Old Man, Loch Lomond (and the West Highland Way), walking around the beaches of Holy Island at low tide (and reading books on the beach in front of Bambourgh), walking along some mountain ridge anywhere in the Highlands, listening to the choir in Durham cathedral, sleeping next to Hadrian's Wall, extreme hill walking in the Lake District... and Gwen recommend's Bill Bryson: Notes from a small Island.

Tuesday, 1 July 2008

VocaListener

Yesterday and today I had the pleasure to spend some time with Masataka Goto talking about how MIR technologies are changing how we create and enjoy music, and in particular what Masataka calls active music listening. It was also great to get some updates on what's happing on the other end of the world.

One thing I found particularly interesting is the VocaListener project which Tomoyasu Nakano (who recently finished his PhD) and Masataka Goto are working on. VocaListener is based on Vocaloid 2 and synthesizes a singing voice that is very hard to discriminate from a real singer.

Here is a video featuring the synthesized voice using Vocaloid 2 (and special techniques to tune the parameters).

VocaListener received a lot of coverage in Japan and some of it has been translated to English, for example: here, here, and here.

Sunday, 29 June 2008

Last.fm's API, Python, and tagging behaviour (Part 2)

Update: (2008/08/25) I fixed the same kind of bug I had in the previous post on this topic. While fixing it I decided to rerun it with a sample of 5000 instead of 2000 users. The code is fixed and the data is updated.

Last night I was a bit tired and quickly concluded "the numbers show...". But do they really?

To find out I use Python to do some simple statistical analysis to add some weight to the claim that older Last.fm users have a larger vocabulary and tag items more frequently.

First, I compute 95% confidence intervals for the percentage of non-taggers in each age group. Seeing the large margins (in the table below) helps explain why the age group 25-30 has a higher percentage than the age group 19-22. However, it doesn’t help explain why the age group 22-25 has a lower percentage. (I’d blame that on the relatively small and skewed sample, and I’d argue that they are still reasonably similar, both in average age and deviations of the percentages). (With the larger sample size this is not the case any longer.)

Computing the confidence interval is very easy. A user is either a tagger or not. The probability within an age group can thus be modeled with a Bernoulli distribution. The 95% confidence intervals for a Bernoulli distribution can be computed with:

z = 1.96 # norminv(0.975) for a 95% confidence interval
def binom_confidence(p,n):
    if n*p*(1-p) >= 5:
        return z*(p*(1-p)/n)**0.5

Btw, I couldn’t find the equivalent to the Matlab norminv function in Python. Any pointers would be appreciated!

To test the hypothesis that the vocabulary size of a tagger depends on her or his age I test the following: Given my observations, are all age groups likely to have the same vocabulary size, i.e, are the differences I observed just random noise? Since the distributions within each age group are far from Gaussian I can’t use a standard ANOVA. Instead I use the non-parametric version of a one-way ANOVA which is the Kruskal-Wallis test. In particular, I use the test to compute a p-value. The p-value is the probability that I would have made the same observation if the hypothesis that there is no difference between age groups would be true. (Thus smaller p-values are better. Usually one would expect at least a value below 0.05 before accepting an alternative hypothesis.) In this case the resulting p-value is nice and low indicating that it's extremely unlikely that older users don't have larger vocabularies.

Here are the results, and below is the Python code.


age   || % non taggers || tagger's median 
                          vocabulary size
14-19 ||     41.3-48.1 ||  6
19-22 ||     37.5-43.6 ||  7
22-25 ||     40.6-47.3 ||  9
25-30 ||     34.8-41.4 ||  8
30-60 ||     28.4-36.0 || 13
Kruskal-Wallis p-value: 1.04e-008

from scipy.stats.stats import kruskal
from numpy import asarray

def print_stats(age_tags):
    age_groups = (14,19,22,25,30,60)
    ll = []
    print "age   || % non taggers || " + \
           "tagger's median \n" + \
           "                          vocabulary size"
    for i in xrange(0,len(age_groups)-1):
        nonzeros = [];
        zero_count = 0
        for j in xrange(age_groups[i],age_groups[i+1]):
            for item in age_tags[j]:
                if item!=0:
                    nonzeros.append(item)
                else:
                    zero_count += 1
        conf = binom_confidence(
               zero_count/float(zero_count+len(nonzeros)), 
               zero_count+len(nonzeros))
        ll.append(nonzeros);
        print \
    "%d-%d || %8.1f-%.1f || %2d" % \
            (age_groups[i],age_groups[i+1], 
            (zero_count/float(max((1,len(nonzeros)+\
                                   zero_count)))-conf)*100, 
            (zero_count/float(max((1,len(nonzeros)+\
                                   zero_count)))+conf)*100, 
            median(nonzeros))
    p = kruskal(*(asarray(ll[i]) for i in 
                 xrange(len(age_groups)-1)))[1]
    print "Kruskal-Wallis p-value: %.2e" % p

~~Btw, that part where I use eval to convert my lists into function arguments could hardly be any uglier. I’m sure there must be a better way of doing that?~~ (Thanks Klaas!)

Last.fm's API, Python, and tagging behaviour

Update: (2008/08/25) I fixed the bug pointed out by thisfred. And I noticed that what I thought was the percentage of non-taggers was actually the ratio of non-taggers vs taggers... I changed that now.

My colleagues completely redesigned the Last.fm API. Inspired by their efforts and all the amazing things the Last.fm community has already built with the old API I decided that I wanted to try doing something with the API as well. The first thing that came to my mind was to use the public API to show that younger people have a smaller tagging vocabulary than older people. I couldn't figure out how to get a user's age from the new API so I used the old one. Anyway, here are the results and I also included the Python script I used. (Btw, any feedback on my Python coding is very welcome, I'm still very much a Python newbie.)

I crawled about 2000 users starting with RJ as seed. The first column is the age group, the second column is the ratio of users who haven't used any tags vs number of users who have used tags, the last number is the median number of unique tags which users who have applied tags have used.


14-19: zeros: 0.44 (120/155), median tags: 6
19-22: zeros: 0.42 (153/215), median tags: 6
22-25: zeros: 0.38 (114/184), median tags: 9
25-30: zeros: 0.43 (141/188), median tags: 10
30-60: zeros: 0.31 (79/179), median tags: 11

The numbers show that older users tag more and apply more unique tags.

from xml.dom import minidom
from urllib import quote, urlopen
from time import sleep
from numpy import median
from collections import defaultdict

seed = 'RJ' # start with Last.fm's CTO
MAX_RETRIES_URL_OPEN = 5

def get_xml(url):
    for i in xrange(MAX_RETRIES_URL_OPEN):
        try:
            sleep(1) # be nice!
            return minidom.parse(urlopen(url))
        except IOError:
            print "(%d/%d) Failed trying to get: %s." % \
                (i, MAX_RETRIES_URL_OPEN, url)

def get_friends(user, friends, ignore_friends):
    url = u'http://ws.audioscrobbler.com/1.0/user/' \
            + quote(user) + u'/friends.xml'
    xmldoc = get_xml(url)
    xmlusers = xmldoc.getElementsByTagName("user")
    for user in xmlusers:
        u = user.getAttribute("username")
        if u not in ignore_friends:
            friends.add(u)
    print "%d/%d" % (len(friends), len(ignore_friends))
    return friends    

def get_age(user):
    ''' returns zero if user has not set his or her age '''
    url = u'http://ws.audioscrobbler.com/1.0/user/' \
            + quote(user) + u'/profile.xml'
    xmlage = get_xml(url).getElementsByTagName("age")
    if len(xmlage)==0: return 0
    return int(xmlage[0].firstChild.nodeValue)

def get_tags(user):
    url = u'http://ws.audioscrobbler.com/1.0/user/' \
        + quote(user) + u'/tags.xml'
    return len(get_xml(url).getElementsByTagName("tag"))
        
def print_stats(age_tags):
    age_groups = (14,19,22,25,30,60)
    for i in xrange(0,len(age_groups)-1):
        nonzeros = [];
        zero_count = 0
        for j in xrange(age_groups[i],age_groups[i+1]):
            for item in age_tags[j]:
                if item!=0:
                    nonzeros.append(item)
                else:
                    zero_count += 1
        print \
    "%d-%d: zeros: %.2f (%d/%d), median tags: %d" % \
            (age_groups[i],age_groups[i+1], 
            zero_count/max((1,float(len(nonzeros)+ \
                             zero_count))),
            zero_count, len(nonzeros), median(nonzeros))


users_notvisited = set([seed])
users_visited = set()

while len(users_notvisited)>0 and \
    len(users_notvisited) + len(users_visited)<2000:
    user = users_notvisited.pop()
    if user not in users_visited:
        users_notvisited = \
            get_friends(user, users_notvisited, \
            users_visited)
        users_visited.add(user)

users = users_notvisited.union(users_visited)

age_tags = defaultdict(list)
i = 0
for user in users:
    i += 1
    print "%d/%d" % (i, len(users))
    age_tags[get_age(user)].append(get_tags(user))
    if i % 5 == 0:
        print_stats(age_tags)

Thursday, 26 June 2008

Matlab, Python, and a Video

I've been using Matlab extensively for probably almost 10 years. I have written more lines of code in Matlab than in any other language. I always have at least one Matlab application window open. I've probably generated at least a few million Matlab figures (one of my most favorite Matlab functions is close all). I've written three small toolboxes in Matlab (and all of them have actually been used by people other than me). I've told anyone who was willing to listen that I couldn't have gotten even a fraction of my work done without Matlab. In fact, 3 times in a row I convinced the places I've been working at that I needed a (non-academic) license for Matlab and several of its toolboxes. I even had a Matlab sticker on my old laptop for a long time. I frequently visited the Matlab news group and I'm subscribed to several Matlab related blogs. If I would have needed to take a single tool with me on a remote island it would have been Matlab. I guess it's fair to say I was in love with Matlab.

However, I always felt that it wasn't a perfect relationship. Matlab is expensive. Matlab is not pre-installed on the Linux machines I remotely connect to. In fact, installing Matlab on Linux is a pain (compared to how easy it is to install it on Windows). Furthermore, not everyone has access to Matlab making it harder to share code. Finally, Matlab can be rather useless when it comes to things that are not best described in terms of matrices that fit into memory, and I can't easily run Matlab code on Hadoop.

I had been playing with Python out of curiosity (wondering why everyone was liking it so much) but I guess I was too happy with Matlab to seriously consider alternatives. But then Klaas showed me how to use Python with Hadoop. Within a very short time I've started to use Python more and more for things I usually would have done in Matlab. Now I write more Python code a day than Matlab code. I still use Matlab on a daily basis, but if I had to choose between Matlab and Python, it would be a very easy choice. SciPy and related modules are wonderful. If I'd redo my PhD thesis, it wouldn't include a single line of Matlab code and instead lot's of Python code :-)

Btw, James pointed me to the following visualization showing activities and shared code of Python developers over time. This is by far the best information visualization I have seen in a very long time. I really like the idea and implementation. I wonder if something similar could be done for a piece of music where the coders are replaced with instruments, and the files are replaced with sounds.

code_swarm - Python from Michael Ogawa on Vimeo.

Wednesday, 25 June 2008

ISMIR'08 Student Travel Award

It's wonderful to see Sun Microsystems sponsoring student travel awards for this year's ISMIR. Submission deadline for applications is July 4th - very soon!

I highly recommend applying for an award even if it might seem like a bureaucratic burden. Sure, any student who gets an award will still need to find additional sources of funding. However, it's always easier to find smaller amounts of money, and as a researcher it is not unusual to spend a lot of time writing project proposals asking for grants. Student travel awards are a great way to start practicing! And writing one page yourself and asking your professor to write a recommendation is actually not a lot of effort. Btw, professors deal with recommendations very frequently, they shouldn't complain if you ask them to write you one :-)

I remember when I received a student travel award for the ACM KDD 2003: It was lots of fun because students who won the award also got a chance to participate in the organization. And helping in the organization of such a huge conference was a great experience.

Tuesday, 24 June 2008

Late Night Thoughts

Last night I went to bed and fell asleep listening to a podcast that Paul recommended. I didn't stay awake for long but I remember that the interviewer seemed skeptical about some of the ideas of open research.

I guess it's natural to be skeptical when something is radically different to what we are used to. I wonder what Newton would have said if someone would have told him to publish or perish and that there are plenty of good journals that publish articles within a year of submission (including the review process). It took Newton almost 22 years to publish his findings, and that was not so unusual back in 1687.

Using those two data points (22 years in 1687 and 1 year in 2008) a simple linear model would suggest that a submit-publish cycle might take less than 3 months in 2020. While I can't see how that linear model is a good fit I could easily see how pushing a publication out within 3 months could be achieved with tools similar to what we know as blogs today.

Furthermore, looking at historic data it's also not too hard to see that unlike some might expect we will see fewer disputes with respect to who discovered what first.

Btw, tonight I'll try to stay awake as long as possible with this podcast about earworms.

Thursday, 19 June 2008

More MIR related PhDs

I just updated the list of MIR PhDs. The most notable update is the thesis by Adam Berenzweig. Unfortunately his thesis is not publicly available for download, but it seems like he has found an answer to why we've been observing hubs using certain similarity measures for music. Quoting from his abstract: "A practical problem with this technique, known as the hub phenomenon, is explored, and we conclude that it is related to the curse of dimensionality."

I'm sure there's still plenty of dissertations missing in the list and I'll be happy to add any that are sent to me. (I'll also be happy to update any broken links or missing information...)

Btw, the list is slowly approaching 100 entries now. The Matlab script I wrote to generate the html files and statistics now takes almost 5 seconds to complete.

Friday, 13 June 2008

Myths about Last.fm tags

Today I was pointed to the following: "Last.fm has thousands of tags, unfortunately they are all pretty bad." (A statement made in this video of a very interesting talk about autotagging and applications that can be built using tags, around minute 51.)

I think this needs some clarification: Last.fm has a lot more than just a few thousand tags. The 1,000,000th unique tag applied by a Last.fm user was earthbeat about half a year ago.

Related links: fun stats on Last.fm tags, Last.fm's multi-tag search.

Tuesday, 10 June 2008

Machine Learning Rant

This rant is inspired by this wonderful blog post which I found through Greg Linden's blog.

Most people who've worked with me might know that I'm very skeptical about using machine learning algorithms. In most of my work I've avoided using them as a solution to a problem.

My problem with machine learning algorithms is the way they are used. I think a beautiful example to illustrate this failure to use them is genre classification. Countless papers have been published claiming around 80% classification accuracy. There have even been a number of papers indicating that these 80% are close to the disagreement level between humans (i.e. the machines 80% are as good as the genre classification performance of any human).

Anyone who has seriously looked at such trained genre classifiers in more detail will have wondered why the measured accuracy is almost perfect and yet the results on a new data set are often not so satisfactory. The simple solution to this specific problem is that instead of genre classification accuracy most researchers have been measuring the artist classification accuracy because training and test set often included pieces from the same artists and most pieces of an artist belong to the same genre. (I've been arguing for the use of an artist filter since 2005, and yet I still see lots of papers published which ignore the issue completely...)

Anyway, the point is that people using machine learning algorithms often consider their problem to be solved if their model performs well on their test set. I often have the impression that no effort is made to understand what the machine has learned.

Most of the time when I explain that I'm skeptic about machine learning I'm confronted with raised eyebrows. How can someone seriously challenge best practice methods? Surely anyone who is skeptic about machine learning has not understood what it is about?

Despite all the responses I've received so far, today after having read the wonderful blog post mentioned above, I feel like there are lots of people out there (many of which are surely a lot smarter than me) who are skeptic about the use of machine learning algorithms. The next time I'm trying to explain why I think that blindly using and trusting a machine learned model is not a solution I'll point to Google's ranking of search results :-)

Having said that, I think there is a lot of potential for machine learning algorithms that generate human readable explanations for the models they generate. In particular, in the same way any human data analyst uses experience, common sense, and data to justify any decision when building a model, I'd like to see machine learning algorithms do the same. In addition, it would be nice if (like a human learner) the algorithm could point to possible limitations which cannot be foreseen based solely on the given data.

I guess I should also add that all of this is just a matter of definitions. With machine learning I mean black boxes which people use without bothering to understand what happens inside them. In contrast to my definition, many consider statistical data mining to be an important part of machine learning (which I sometimes don't, because it requires specific domain knowledge, as well as human learning, understanding, and judgment). Furthermore, I have no doubts that Google applies machine learning algorithms all the time, and that combining machine learning with human learning is a very natural and easy step, unlike my above rant might indicate.

Monday, 9 June 2008

Quaero

While trying to catch up with my emails I stumbled upon Geoffrey Peeters' mail in the Music-IR list. He still has 2 open positions in the Quaro project, which is a huge project I know almost nothing about. A quick Google search revealed that Quaro is French, it's worth 200M€, will last 5 years, has 24 partners, and is Latin for "I'm searching".

I guess only a tiny fraction of Quaero will be dealing with music and MIR research. However, a tiny fraction of 200M€ is still huge, and the topics Geoffrey mentions are very interesting: genre/mood classification, music similarity, chorus detection, ... Quaero also seems to have a strong emphasis on gathering and annotating data that can be used to evaluate and improve algorithms.

Btw, there is some more information about Quaero in this presentation.

I found it interesting that the administrative work on Quaero started in August 2005. Operational work (research etc) will start November 2008. I wonder how much overhead costs they have and how much flexibility they still have?

On slide #7 the concrete innovations are mentioned (and 5 examples are given). They mention detecting soccer goals in a recording of a soccer match, identifying the songs in a sound track, and using automatic translation to enable people to search in different languages. I wonder if they had to finalize the innovative outcomes of Quaero back in 2005?

Overall it seems like part of Quaero's strategy is to enter markets which have already been covered by many startups as well as large players such as Google. In particular, it almost seems like Quaro is intended to help Exalead survive a bit longer? (Exalead is a French search engine startup. I briefly tried their search and it seemed a bit slow. When searching for "music information retrieval" they fitted 1.25 proper search results on my 15.4" screen the rest was ads and other information I don't care about... I'm not surprised they aren't doing well - not even in France.)

Anyway, IRCAM is leading Quaero's music technology, so I expect we'll see lots of great outcomes.

Sunday, 1 June 2008

Open Research

In response to what Paul just blogged about...

I'm curious how in the long run blogging about ongoing research will transform communication within the research community. I've already been curiously following blogs by some PhD students such as Mark or Yves who are very open about their ongoing work and ideas.

I don't think blogs can replace publishing papers at research conferences or journals. However, I wouldn't be surprised to see more and more references to blog posts in conference papers in the future.

Blogs would be the perfect communication platform for researchers if:

there would be a guarantee that a blog post will be around for ever (i.e., that researchers in 20 years from now can go back and look at it)
if it would not be possible to alter any information published on a blog (or at least to detect if something has been altered), this includes not being able to change the date when an idea or result was first published

It seems that one way to overcome both limitations would be to have authorities frequently crawl, store, and index blog posts related to research. Another option might be to have something like a "research mode" on popular blogging service providers such as blogger.com: if the blogger opts into this mode, then the researcher wont be able to ever change his blog posts (including deletion) and the blog posts will be indexed by search engines.

However, even as publishing preliminary research results on blogs becomes an accepted standard I wouldn't be surprised if some unfortunate researchers without ideas of their own consider an idea published on a blog to be not published at all and try to publish it at a conference with their own names on it without referencing the source. I'm sure though, that such attempts would ultimately fail as the blogging research community would point them out quickly.

I wouldn't be surprised if a few researchers starting to use blogs to communicate ongoing results will trigger a snowball effect. For example, if Paul starts blogging about ongoing research that someone else is currently working on, wouldn't that other researcher feel urged to publicly state that he or she is also working on the same topic? Otherwise, by the time this other person publishes results, everyone might think that those ideas were just copied from Paul's blog?

Looking at how research has developed over the past centuries, the direction we have consistently been heading in seems very obvious: more openness and getting results out faster. Research blogs seem like a very natural next step in the evolution of communication in the science community.