
I’ve been planning to do this for over a year: The
MA (Music Analysis) Toolbox for Matlab now finally includes the G1C implementation which I described in my
thesis (btw, the code probably also runs in the freely available
Scilab in case you don’t have Matlab). The code is packaged as required for the
MIREX’06 evaluation, where the implementation was overall fastest and scored highest (but not significantly better than other submissions).
The code might be useful for those who are new to the field and just want a quick start. Btw, last October I held a
presentation on music similarity which might also be helpful for starters and the best documentation and explanation of the code I can offer is my thesis.
I also hope the implementation is somehow useful for those interested in comparing their work on computational models of music similarity to work by others. I believe the best option to do so is to conduct perceptual tests similar to those I conducted for my thesis and those done for MIREX’06 (btw, I wrote some comments about the MIREX’06 evaluation
here).
A much easier approach to evaluate many different algorithms is to use a genre classification scenario (assuming that pieces from the same genre are generally more similar to each other than pieces from different genres). However, this doesn’t replace perceptual tests it just helps pre-select the algorithms (and their parameters). Btw, I think it would even be interesting for those working directly on genre classification to compare G1C (combined with a NN classifier) against their genre classification algorithms.
There are lots of things to be careful about when running evaluations based on genre classes (or other tags associated with music). Most of all I think everyone should be using an artist filter: The test set and the training set shouldn’t contain music from the same artists. Some previous work reported accuracies of up to 80% for genre classification. I wouldn’t be surprised to see some of those numbers drop to 30% if an artist filter had been applied.
I first noticed the impact of an artist filter when I was doing some work on
playlist generation. In particular, I noticed that songs from the same artist appeared very frequently in the top 20 most similar lists for each song, which makes sense (because usually pieces by the same artists are somehow similar). However, some algorithms which were better than others in identifying songs from the same artists did not necessarily perform better in finding similar songs from other artists. I reported the differences in the evaluation at
ISMIR’05, discussed them again in my
MIREX'05 submission, and later in my thesis. An artist filter was also used for the MIREX’06 evaluation. Btw, I’m thankful to
Jean-Julien Aucouturier (who was one of the reviewers of that ISMIR’05 paper) for some very useful comments on that. His
thesis is highly relevant for anyone working on computation models of music similarity.
Another thing to consider when running evaluations based on genre classes is to use different music collections with different taxonomies to measure overfitting. For example, one collection could be the
Magnatune ISMIR 2004 training set and one could be the researcher’s private collection. It can easily happen that a similarity algorithm is overfitted to a specific music collection (I demonstrated this in my thesis using a very small collection). Although I was careful to avoid overfitting, G1C is slightly overfitted to the Magnatune collection. Thus, even if G1C outperforms an algorithm on Magnatune, the other algorithm might still be much better in general.
There’s some room for improvements of this G1C implementation in terms of numerical issues, and some parts can be coded a lot more efficiently. However, I’d recommend trying something very different. Btw, I recently noticed how much easier it is to find something that works much better when having lots of great data. I highly recommend using Last.fm’s tag data for evaluations, there’s even an
API.