Tuesday, 18 September 2007

Overfitting and MIREX

IMIRSEL (the organizer of MIREX) hasn't officially responded yet to the conflicts of interest of organizing a non transparent evaluation and at the same time participating in it. What I've heard from others is that they don't see any problems with it.

Btw, has anyone else noticed that they won in every classification category where overfitting is a big issue? However, in a very related category (mood classification) where overfitting isn't an issue (thanks the a human component in the evaluation) they were outperformed by several others.

Furthermore, IMIRSEL never had their name put down on the list of potential candidates. Given the lack of transparency of the respective MIREX tasks I think this is something every participant should have known before submitting their work. Btw, so far it isn't even known who the researchers are who actually did the work. AFAIK, no entry so far in the history of ISMIR evaluations has been submitted without mentioning who the authors are.

Btw, as to now, IMIRSEL are the only ones in the genre classification task who haven't published an abstract (describing what their algorithm does and how it was optimized) yet. (They also haven't submitted one yet for the other tasks they won in.)

Regarding anonymous MIREX submissions I just remembered that at the ISMIR 2004 evaluation hosted by MTG allowed anonymous submissions... and some authors did choose to do so. (However, as I already mentioned in the comments of this post: MTG clearly stated that they did not participate in the tasks they organized to avoid any conflict of interest.)


Kris West said...


You have you answer on the tuning of the system. It is most likely not the answer you hoped but - but nevertheless it is the truth. The system was not tuned - in fact it was not tested on any dataset (from the competition or otherwise) beyond making sure it was outputting feature values into its feature files and was in fact cobbled together in one evening.

The exact process was: Andreas Ehmann produced a feature extraction itinerary (Mean and variance or MFCCs deltas and accelerations, zero-crossings, flux and centroid). There was a problem with this itinerary (it was opening audio files extracting features but never closing the process and writing out to feature files - eventually throwing an exception due to too many open files). I took a look at it, fixed the data flow bug, deleted the acceleration coefficients, added the spectral entropy, RMS and Roll-off (no great advances here, I've just found the covariance of the RMS and spectral entropy with other features useful of late) and finally swapped the mean and variances for a mean and covariance module.

To be safe Andy then applied two classifiers to the features in the classification itinerary (weka kNN and Polynomial SMO - default settings) and the itinerary got run on the datasets. The SMO version end up being very close in style to the submission from Mandel and Ellis in 2005 - which is of course where it draws its inspiration from (using the mean and covariance of the MFCCs and other features - a refinement that you used in your thesis Elias). To sum up the itinerary was not tuned, was not tested, was not trained on the test set or anything else you might accuse us of - it was the result of a few hours tinkering. No one expected it to do well or to 'win' and that was not the intention - it was just supposed to NOT suck. To be honest I am personally somewhat embarrassed at its performance - but think it indicates that perhaps a hefty shift in technique would be required to *significantly* improve on current techniques. Further, the same code was run unmodified in all 4 tasks and will of course be released with M2K (all modules already exist and the itineraries will be added). If an independent run of the classification itinerary in question is required (on 3 tasks - I'm sure Elias takes no issue with the Mood results) I hope we can rely on Paul to provide the test conditions.

Finally, Elias this brings me to your behaviour. You've made unfounded accusations not out of any genuine concern but in the gleeful recognition that you could hurt the IMIRSEL group - whom you regularly act towards as a spoilt only child would when someone else is playing with their toys. Your frequently childish and impetuous behaviour over the last 2 years has never reflected well on yourself, your (brick-lane) employers or the rest of the MIR research community. I have had to deal with your childish tirades in both private and public emails regularly in the course of each MIREX and I'm fed up with it. You seem to have no detachment or impartiality at all and have behaved in a fashion that no other member of the MIR community has ever felt the need to emulate.

My apologies to members of the evalfest and music-ir lists for my language - and that Elias felt the need to spill this over onto the lists. Should this thread continue in an inappropriate fashion please remove myself and any other offenders from the music-ir and Evalfest lists. I will endeavour not respond to further blatant baiting.

Finally, I hope this doesn't detract from another incredible performance in these tasks: I suspect the differences in performance between the top few submissions in each task are not significant (we need a fair test for that - Stephen is unsure about the assumptions used in the McNemar's test - suggestions?) and therefore in my book the clear 'winner' is George Tzanetakis and Marsyas - I'm not convinced I could have decoded the MP3s in the time it took to run the Marsyas submission through the entirety of each task (kudos to George, Graham & Luis and the rest of the Marsyas team)!

Kris West

P.S. > I did put my name down on the list of potential particpants. And yes IMIRSEL did officially respond to your taunts. Finally, you clearly have no idea how much work they have to put into these evaluations or you would understand why the abstract is late. A poster was produced first (as it had to be printed) and will be displayed at ISMIR.

Kris West said...

Below is a response originally sent to the music-ir and evalfest lists by Cameron Jones of the IMIRSEL tem (who I hope doesn't mind me reproducing it):

The recent discussions on Elias Pampalk's
and Paul Lamere's (http://blogs.sun.com/plamere/entry/is_mirex_fair)
blogs on whether or not it is fair for the IMIRSEL team to participate
in the MIREX tasks has upset several members of IMIRSEL who always
strive for the utmost in fairness, thoroughness, and accuracy. As a
member of IMIRSEL, I can think of at least 4 reasons why IMIRSEL's
submission is legitimate, and above-the-bar. It is my hope to dispell
any hint of impropriety that may have been cast by the recent

1. The features selected were based on the features used by other
labs in previous MIREXes, and other publicly available research.
2. The feature extractors we used were not developed by someone with
direct access to the data.
3. The classifiers we used were standard WEKA implementations.
4. IMIRSEL is the "Systems EVALUATION Lab" not the "Systems
Development Lab" and therefore, does not engage in large-scale MIR
system building activities of any sustained length. Meaning our
submission could not have been "tuned" as has been claimed.

1. Features were based on previous MIREX submissions.

The feature sets used in the IMIRSEL submission were based on a set of
features developed and used by Mandel and Ellis in MIREX 2005. Mandel
and Ellis had the a submission which performed well in both Audio
Artist Identification, and Audio Genre Classification tasks in MIREX
2005. The IMIRSEL submission used several additional psychoacoustic
features based on the dissertation research of Kris West. To me, this
kind of submission embodies the goal of MIREX: the publication of
algorithms and the meaningful comparison of their performance, thus
allowing MIR researchers to make informed, justifiable decisions about
what algorithms to use. IMIRSEL did not systematically compare
possible feature sets against the data. We built one working feature
set and reported the findings. The EXACT SAME feature extractor was
used on ALL of our submissions (for all tasks in which IMIRSEL
participated). Our decisions to use this feature set was based on
publicly knowable data and utilized no insider knowledge.

2. Feature extractors were developed by Kris West

The feature extractors used in the IMIRSEL submission were not
developed by anyone with direct access to this year's submission data.
In total, 2 feature extractors were written using M2K. The first
feature extractor, developed by Andreas Ehmann (a member of the
IMIRSEL lab), had a bug and crashed our servers, and thus did not
generate any meaningful features which could be used. Kris West (a
associated member of IMIRSEL, but not resident in Illinois) developed
the second feature extractor for the IMIRSEL submission. Kris did not
have any knowledge of, or direct access to IMIRSEL's databases when
building the extractors, beyond what was available on the MIREX Wiki
pages (public knowledge of the task definitions). Although Kris is
affiliated with IMIRSEL, he is pursuing his own independent research
agenda, and is not active in the day-to-day operations of the lab, nor
decisions about data management, beyond those posted to the public

3. Classifiers were not "tuned"

The classifiers we used were all standard Weka packages. We used
Weka's KNN and Poly SMO classifers. The Poly SMO classifier was used
with default parameters. The KNN submission was likewise run with
minimal configuration, I believe K was set to 9 because when we used
the default value of 10, it crashed on one of the splits of the N-Fold
validation. The belief that we may have possibly iteratively tuned and
optimzed our submission is just wrong.

4. Our passion is evaluation.

Overall, IMIRSEL is about evaluation, not algorithm development. While
it is true that IMIRSEL is responsible for the development of M2K, we
do not usually spend our days thinking about how to develop new
algorithms, approaches, or feature sets. What we do spend a lot of
time working on is improving the design of MIREX, the design and
selection of tasks, the evaluation metrics we use in MIREX, the
validity of the results we have. We spend a lot of our time looking
over past MIREX results data and interpreting it, looking for patterns
and anomalies, and overall trying to make sure that MIREX is being
executed to the best of our abilities.

Because of this, we did not have an "in house" submission lying around
that we could have submitted. We were not working on our submission
for months before hand, carefully selecting the feature sets, tuning
the classifier parameters, etc. We were too busy preparing for, and
then running MIREX. Rather, our submission this year was an attempt to
demonstrate to the community the power and flexibility of some new M2K
modules which integrate existing music and data mining toolkits, like
Weka and Marsyas. M2K presents a robust, high-speed development
environment for end-to-end MIR algorithm creation. IMIRSEL's
submission was supposed to be an also-ran, developed in response to a
challenge from Professor Downie to see "what could be hacked together
in M2K, QUICKLY!!!! We do not have a lot of time to fuss.". In
reality, the IMIRSEL submission was built in one evening.

Finally, as has been stated repeatedly, MIREX is not a competition and
there are no "winners". So, rather than wasting time arguing about
what is fair or not, we should be using this opportunity to learn
something. Why is it that IMIRSEL's algorithm performed as well as it
did (which is not, keep in mind, necessarily a statistically
significant performance difference from the next highest scores)?

M. Cameron Jones

Graduate Research Assistant
International Music Information Retrieval Systems Evaluation Lab

PhD Student
Graduate School of Library and Information Science