Monday, 17 September 2007

MIREX Results Online!

The MIREX results just got posted by Stephen Downie. Interestingly the organizers scored highest in a number of categories. To be honest, if I were a participant in a task like genre classification I’d be a bit suspicious. (Knowing the distribution of the genres before hand can be a huge advantage when designing an algorithm.)

Congratulations to Tim Pohle and Dominik Schnitzer (two very clever PhD students I once worked together with in Vienna) who scored highest in the audio similarity task. I wouldn’t be surprised if they also had one of the fastest implementations. Tim also scored second highest last year in the same task. And Dominik recently made the results of his Master’s thesis available (open source playlist generation).

Congratulations also to Joan SerrĂ  and Emilia Gomez (a former SIMAC colleague) who scored highest in the cover song identification task.

And congratulations to everyone who participated and the organizers for managing to complete all the tasks before ISMIR!


Kris West said...


(Knowing the distribution of the genres before hand can be a huge advantage when designing an algorithm.)

I totally agree. This is why the genres, the genre hierachy (used in scoring confusion and for anyone who can use it for learning) and the number of examples for each genre were posted to the genre classification page of the MIREX 2007 wiki some time ago.



Elias said...

@ Kris: Ok, so even if the number of pieces per genre were known before the deadline: someone working at IMIRSEL could have had a big advantage by knowing, e.g., what kind of "Classical" music was in the collection. Piano, orchestra, string quartets? How does "Classical" differ from the music in the category "Baroque"? Knowing all this could have helped design a specifically overfitted classifier that outperforms others on that collection. And all of this was known to IMIRSEL, so if IMIRSEL scores highest, I think it's just fair to be skeptical.

When MTG organized the first ISMIR evaluation in 2004 they clearly stated that they didn't participate in the tasks to avoid conflicts of interest. As long as MIREX isn't more transparent, I think clearly avoiding any such conflicts is the only way to go. It's kind of like not using the same data for training and testing.

As far as I'm concerned, anyone who had direct contact with the test data before the evaluation should not be listed in the same rankings as everyone else.

Btw, can I assume that there were no pieces from an artist in the test set which also had pieces in the training set? The genre classification and artist identification ranks are highly correlated (they are basically identical)...

Kris West said...

Hi Paul & Elias,


As always, you've acted in a gentlemanly fashion.


You have you answer on the tuning of the system. It is most likely not the answer you hoped but - but nevertheless it is the truth. The system was not tuned - in fact it was not tested on any dataset (from the competition or otherwise) beyond making sure it was outputting feature values into its feature files and was in fact cobbled together in one evening.

The exact process was: Andreas Ehmann produced a feature extraction itinerary (Mean and variance or MFCCs deltas and accelerations, zero-crossings, flux and centroid). There was a problem with this itinerary (it was opening audio files extracting features but never closing the process and writing out to feature files - eventually throwing an exception due to too many open files). I took a look at it, fixed the data flow bug, deleted the acceleration coefficients, added the spectral entropy, RMS and Roll-off (no great advances here, I've just found the covariance of the RMS and spectral entropy with other features useful of late) and finally swapped the mean and variances for a mean and covariance module.

To be safe Andy then applied two classifiers to the features in the classification itinerary (weka kNN and Polynomial SMO - default settings) and the itinerary got run on the datasets. The SMO version end up being very close in style to the submission from Mandel and Ellis in 2005 - which is of course where it draws its inspiration from (using the mean and covariance of the MFCCs and other features - a refinement that you used in your thesis Elias). To sum up the itinerary was not tuned, was not tested, was not trained on the test set or anything else you might accuse us of - it was the result of a few hours tinkering. No one expected it to do well or to 'win' and that was not the intention - it was just supposed to NOT suck. To be honest I am personally somewhat embarrassed at its performance - but think it indicates that perhaps a hefty shift in technique would be required to *significantly* improve on current techniques. Further, the same code was run unmodified in all 4 tasks and will of course be released with M2K (all modules already exist and the itineraries will be added). If an independent run of the classification itinerary in question is required (on 3 tasks - I'm sure Elias takes no issue with the Mood results) I hope we can rely on Paul to provide the test conditions.

Finally, Elias this brings me to your behaviour. You've made unfounded accusations not out of any genuine concern but in the gleeful recognition that you could hurt the IMIRSEL group - whom you regularly act towards as a spoilt only child would when someone else is playing with their toys. Your frequently childish and impetuous behaviour over the last 2 years has never reflected well on yourself, your (brick-lane) employers or the rest of the MIR research community. I have had to deal with your childish tirades in both private and public emails regularly in the course of each MIREX and I'm fed up with it. You seem to have no detachment or impartiality at all and have behaved in a fashion that no other member of the MIR community has ever felt the need to emulate.

My apologies to members of the evalfest and music-ir lists for my language - and that Elias felt the need to spill this over onto the lists. Should this thread continue in an inappropriate fashion please remove myself and any other offenders from the music-ir and Evalfest lists. I will endeavour not respond to further blatant baiting.

Finally, I hope this doesn't detract from another incredible performance in these tasks: I suspect the differences in performance between the top few submissions in each task are not significant (we need a fair test for that - Stephen is unsure about the assumptions used in the McNemar's test - suggestions?) and therefore in my book the clear 'winner' is George Tzanetakis and Marsyas - I'm not convinced I could have decoded the MP3s in the time it took to run the Marsyas submission through the entirety of each task (kudos to George, Graham & Luis and the rest of the Marsyas team)!

Kris West

P.S> Genre splits were artist filtered and Artist ID splits were album filtered.

Kris West said...

Below is a response originally sent to the music-ir and evalfest lists by Cameron Jones of the IMIRSEL tem (who I hope doesn't mind me reproducing it):

The recent discussions on Elias Pampalk's
and Paul Lamere's (
blogs on whether or not it is fair for the IMIRSEL team to participate
in the MIREX tasks has upset several members of IMIRSEL who always
strive for the utmost in fairness, thoroughness, and accuracy. As a
member of IMIRSEL, I can think of at least 4 reasons why IMIRSEL's
submission is legitimate, and above-the-bar. It is my hope to dispell
any hint of impropriety that may have been cast by the recent

1. The features selected were based on the features used by other
labs in previous MIREXes, and other publicly available research.
2. The feature extractors we used were not developed by someone with
direct access to the data.
3. The classifiers we used were standard WEKA implementations.
4. IMIRSEL is the "Systems EVALUATION Lab" not the "Systems
Development Lab" and therefore, does not engage in large-scale MIR
system building activities of any sustained length. Meaning our
submission could not have been "tuned" as has been claimed.

1. Features were based on previous MIREX submissions.

The feature sets used in the IMIRSEL submission were based on a set of
features developed and used by Mandel and Ellis in MIREX 2005. Mandel
and Ellis had the a submission which performed well in both Audio
Artist Identification, and Audio Genre Classification tasks in MIREX
2005. The IMIRSEL submission used several additional psychoacoustic
features based on the dissertation research of Kris West. To me, this
kind of submission embodies the goal of MIREX: the publication of
algorithms and the meaningful comparison of their performance, thus
allowing MIR researchers to make informed, justifiable decisions about
what algorithms to use. IMIRSEL did not systematically compare
possible feature sets against the data. We built one working feature
set and reported the findings. The EXACT SAME feature extractor was
used on ALL of our submissions (for all tasks in which IMIRSEL
participated). Our decisions to use this feature set was based on
publicly knowable data and utilized no insider knowledge.

2. Feature extractors were developed by Kris West

The feature extractors used in the IMIRSEL submission were not
developed by anyone with direct access to this year's submission data.
In total, 2 feature extractors were written using M2K. The first
feature extractor, developed by Andreas Ehmann (a member of the
IMIRSEL lab), had a bug and crashed our servers, and thus did not
generate any meaningful features which could be used. Kris West (a
associated member of IMIRSEL, but not resident in Illinois) developed
the second feature extractor for the IMIRSEL submission. Kris did not
have any knowledge of, or direct access to IMIRSEL's databases when
building the extractors, beyond what was available on the MIREX Wiki
pages (public knowledge of the task definitions). Although Kris is
affiliated with IMIRSEL, he is pursuing his own independent research
agenda, and is not active in the day-to-day operations of the lab, nor
decisions about data management, beyond those posted to the public

3. Classifiers were not "tuned"

The classifiers we used were all standard Weka packages. We used
Weka's KNN and Poly SMO classifers. The Poly SMO classifier was used
with default parameters. The KNN submission was likewise run with
minimal configuration, I believe K was set to 9 because when we used
the default value of 10, it crashed on one of the splits of the N-Fold
validation. The belief that we may have possibly iteratively tuned and
optimzed our submission is just wrong.

4. Our passion is evaluation.

Overall, IMIRSEL is about evaluation, not algorithm development. While
it is true that IMIRSEL is responsible for the development of M2K, we
do not usually spend our days thinking about how to develop new
algorithms, approaches, or feature sets. What we do spend a lot of
time working on is improving the design of MIREX, the design and
selection of tasks, the evaluation metrics we use in MIREX, the
validity of the results we have. We spend a lot of our time looking
over past MIREX results data and interpreting it, looking for patterns
and anomalies, and overall trying to make sure that MIREX is being
executed to the best of our abilities.

Because of this, we did not have an "in house" submission lying around
that we could have submitted. We were not working on our submission
for months before hand, carefully selecting the feature sets, tuning
the classifier parameters, etc. We were too busy preparing for, and
then running MIREX. Rather, our submission this year was an attempt to
demonstrate to the community the power and flexibility of some new M2K
modules which integrate existing music and data mining toolkits, like
Weka and Marsyas. M2K presents a robust, high-speed development
environment for end-to-end MIR algorithm creation. IMIRSEL's
submission was supposed to be an also-ran, developed in response to a
challenge from Professor Downie to see "what could be hacked together
in M2K, QUICKLY!!!! We do not have a lot of time to fuss.". In
reality, the IMIRSEL submission was built in one evening.

Finally, as has been stated repeatedly, MIREX is not a competition and
there are no "winners". So, rather than wasting time arguing about
what is fair or not, we should be using this opportunity to learn
something. Why is it that IMIRSEL's algorithm performed as well as it
did (which is not, keep in mind, necessarily a statistically
significant performance difference from the next highest scores)?

M. Cameron Jones

Graduate Research Assistant
International Music Information Retrieval Systems Evaluation Lab

PhD Student
Graduate School of Library and Information Science

Anonymous said...

can someone explain to me what the MIREX does, whats its goal is. i dont understand what it is supposed to accomplish. dumb it down for me