Sunday 27 May 2007

Music Similarity: G1C Implementation

I’ve been planning to do this for over a year: The MA (Music Analysis) Toolbox for Matlab now finally includes the G1C implementation which I described in my thesis (btw, the code probably also runs in the freely available Scilab in case you don’t have Matlab). The code is packaged as required for the MIREX’06 evaluation, where the implementation was overall fastest and scored highest (but not significantly better than other submissions).

The code might be useful for those who are new to the field and just want a quick start. Btw, last October I held a presentation on music similarity which might also be helpful for starters and the best documentation and explanation of the code I can offer is my thesis.

I also hope the implementation is somehow useful for those interested in comparing their work on computational models of music similarity to work by others. I believe the best option to do so is to conduct perceptual tests similar to those I conducted for my thesis and those done for MIREX’06 (btw, I wrote some comments about the MIREX’06 evaluation here).

A much easier approach to evaluate many different algorithms is to use a genre classification scenario (assuming that pieces from the same genre are generally more similar to each other than pieces from different genres). However, this doesn’t replace perceptual tests it just helps pre-select the algorithms (and their parameters). Btw, I think it would even be interesting for those working directly on genre classification to compare G1C (combined with a NN classifier) against their genre classification algorithms.

There are lots of things to be careful about when running evaluations based on genre classes (or other tags associated with music). Most of all I think everyone should be using an artist filter: The test set and the training set shouldn’t contain music from the same artists. Some previous work reported accuracies of up to 80% for genre classification. I wouldn’t be surprised to see some of those numbers drop to 30% if an artist filter had been applied.

I first noticed the impact of an artist filter when I was doing some work on playlist generation. In particular, I noticed that songs from the same artist appeared very frequently in the top 20 most similar lists for each song, which makes sense (because usually pieces by the same artists are somehow similar). However, some algorithms which were better than others in identifying songs from the same artists did not necessarily perform better in finding similar songs from other artists. I reported the differences in the evaluation at ISMIR’05, discussed them again in my MIREX'05 submission, and later in my thesis. An artist filter was also used for the MIREX’06 evaluation. Btw, I’m thankful to Jean-Julien Aucouturier (who was one of the reviewers of that ISMIR’05 paper) for some very useful comments on that. His thesis is highly relevant for anyone working on computation models of music similarity.

Another thing to consider when running evaluations based on genre classes is to use different music collections with different taxonomies to measure overfitting. For example, one collection could be the Magnatune ISMIR 2004 training set and one could be the researcher’s private collection. It can easily happen that a similarity algorithm is overfitted to a specific music collection (I demonstrated this in my thesis using a very small collection). Although I was careful to avoid overfitting, G1C is slightly overfitted to the Magnatune collection. Thus, even if G1C outperforms an algorithm on Magnatune, the other algorithm might still be much better in general.

There’s some room for improvements of this G1C implementation in terms of numerical issues, and some parts can be coded a lot more efficiently. However, I’d recommend trying something very different. Btw, I recently noticed how much easier it is to find something that works much better when having lots of great data. I highly recommend using’s tag data for evaluations, there’s even an API.


Anonymous said...

It is good to hear that interested people do not have to reconstruct G1C from you thesis anymore now :) However, the last paragraph of this blog entry caught my interest even more. I wonder what exactly it is that you mean by "I'd recommend trying something very different", and I am also very curious about the details of the alternative approach based on "lots of great data" that you mention. Moreover, I kind of understood from your previous post that you are looking at ways to use content-based techniques to improve similarities obtained from humans (e.g. by using collaborative filtering). Hopefully, your ISMIR paper(s) will elaborate a bit more on some of these things :)

Elias said...

Klaas, what I meant by "I'd recommend trying something very different" is that I don't think it would make sense to just try to tweak the parameters of G1C. Instead, for example, I think that what you've been doing makes a lot of sense (finding better ways to combine different features etc.).

What I meant with "great data" are the tags that users associate with music (and which are available through's API). Songs have been tagged with emotions, genres, styles, instruments, and many other very useful words.

I doubt there are any tags for the Magnatune music collection, but I'm sure there are lots of tags for Brian Whitman's USPOP 2002 which I believe Dan Ellis is still distributing (in form of MFCCs).

Regarding the alternative approach I've mentioned... maybe we'll publish it at ISMIR next year. But I'd prefer to see someone else in the MIR community come up with something better, because it's still far from perfect.

Btw, if you are interested in doing some work on combining collaborative filtering data with content-based techniques you might want to check out's data (up to 2005) which is available here. And you might also want to check the API to obtain more recent data.

Anonymous said...

The link to your presentation is broken, just remove the
at the end and it should work. Thanks for the great code!