I’ve added the paper to my list of publications, there’s a link to the PDF and to the demonstration videos.
The paper was 6 pages long when we first submitted it and accepted as a 4 page version in the proceedings. (For the final version we had to shorten almost everything a bit. The biggest part we dropped was a comparison of the interface to a simplified version which was closer to the simple Google search interface. When reading the reviews it’s important to keep in mind that the reviewers were reading a rather different version of the paper than the one which is online now.) I’ve added my remarks to the reviewers in italic.
I really like receiving feedback to my work, and usually conference and journal reviews are a wonderful source of feedback. However, one thing I missed in the reviews the program chairs sent me was the final paper length each reviewer recommended (there was a field for that in the review form). Maybe next year they could improve this.
Btw, I would like to thank reviewers 1 and 2 as they have helped improve the quality of the paper. Reviewer 3 seems to have forgotten to write something. Reviewer 4 has helped a bit, but pissed me off a bit more. However, others have told me that there is nothing wrong with his or her review. I guess my view on this is not very objective ;-)
+++ Strong Accept, ++ Accept, + Weak Accept
--- Strong Reject, -- Reject, - Weak Reject
Reviewer 1: Detailed comments
Overall: Cool interface, and a decent user study. In general, I would prefer to see a more task-specific evaluation (how long does it take a user to find music they like? do they succeed?) and for specific features how often are they used? But the self-reporting survey is a good start.
It’s always nice when a reviewer starts the review with something positive :-)
Regarding “how long does it take a user to find music they like?”: I think an interface to explore music is closer to a computer game than a tool to get work done. For games measuring how much fun the users are having is way more important than measuring how long it takes to get something done (which is one of the main criteria for tools). Nevertheless, I agree with the reviewer, the evaluation we provided is not as thorough as it could be. (Although this has been by far the most extensive evaluation of a user interface I’ve ever conducted.)
My biggest criticism is that there isn't a true baseline; users are only comparing two different versions of the author's system, and not comparing against, say a simple web search, or a competing tool like collaborative filtering.
This comparison the reviewer is referring to has been removed because we didn’t have enough space.
Regarding baselines: comparing our system against state-of-the-art recommendation engines such as the one from Last.fm wouldn’t have been a fair comparison either. We thought that since the main contributions of our paper are the new interface features, a useful evaluation would have been to remove those new elements and see what the users think. I’d be very interested to get some more advice on how to better evaluation user interfaces to explore music.
- “music” and “review” as constraints for finding relevant web docs: for many ambiguous band names, this doesn't perform very well. For instance, for the band “Texas”, the query “texas music review” brings up many irrelevant pages. this is a hard problem, and for research systems probably not too important to worry about, but it may be worth mentioning.
Excellent point. My only excuse is that we didn’t have enough space to discuss everything.
- how do you generate the vocabulary lists? if it's just and ad-hoc manual process, please mention that, and perhaps suggest other ways to make it more principled and to evaluate changes to these lists.
Good question. I think we’re a bit more specific on this in the 4 page version, but maybe it got squeezed out at the end. Automatically generation such vocabularies would be really interesting. But I wouldn’t know how to do that (without using e.g. Last.fm tag data).
- Similarly, why did you choose 4 vocabularies vs some other number?
I think we didn’t have space to explain this in the 4 page version.
The number of vocabularies is more or less random. The vocabularies are based on previous work (we used them in an ISMIR’06 paper, and before that in an ECDL’05 paper). However, there’s an upper limit on the number of vocabularies that would make sense to use, and the users seemed fine with the 4 we used. (Of course it would be really nice to adapt it to different languages as well.)
- not all readers will know what “tf-idf” is. please explain or provide a reference.
We added a more explicit reference in the final version. I find things like these really hard to notice. I talk about tfidf all the time and just started assuming that the whole world talks about tfidf the whole time, too. :-)
- table 3: unnecessarily confusing to introduce the “L” and “R” just call them Easy and Hard, and explain that you grouped the top/bottom 3 points in a 7-point scale. and instead of “?”, label it “No Answer” or something
Again something that’s hard to notice if you are too deep into the material. Thanks to this reviewers comment the respective table should be more understandable.
- rather than the self-reported results about which optional features the user found useful, i think a better eval would be a count of actually how many times the user used them.
Actually the usage of features wasn’t something the users reported themselves. I was sitting next to them and taking notes while they were using the interface. Thinking of it now, I realize that maybe even in the final version this might not be clear enough :-/
However, automatically counting how often functions were used would have been better. Unfortunately, I didn’t store the log files for all users because I had some technical problems (and I though that making notes while watching them use the interface would be sufficient). Next time...
Reviewer 2: Detailed comments
This is a well written paper... and it demonstrates a nice system for showing users new songs.
Again, it’s really nice when a reviewer starts a review with something positive.
This paper largely walks through one design, a very nice looking design, but how does this system compare to other systems? I found the evaluation in this paper weak.
How does the part of this work that combines recommendations compare to the work of Fagin (Combining Fuzzy Information From Multiple Systems )? Would that be a better approach?
I’m not familiar with Fagin. (But fuzzy combinations sound interesting.) Regarding the reviewer’s criticism of the evaluation I guess they are in line with Reviewer 1.
UPDATE: Klaas has posted a link to a really nice introduction to aggregation operators in the comments.
I really wanted to know which similarity approach worked best. This paper doesn't address that issue.
This was beyond the scope of our paper. But it surely would have been very interesting to evaluate the different individual similarities we used :-/
Testing UI design is hard.. one needs a task and then lots of users. Can you do this?
Yes, it is hard :-)
And no, it doesn’t seem like we did a good job :-/
Reviewer 3: Detailed comments
Unfortunately this reviewer didn't explain why he or she gave us such high scores.
Reviewer 4: Detailed comments
The paper addresses an interesting issue, recommendation systems and interfaces to support them. I found the idea of using multiple information sources very interesting, and potentially useful.
Again, it’s always nice when a reviewer finds something positive to start with. However, the idea of using multiple information sources for recommendations isn’t new, and I don’t think my co-author and I can take the credit for it. And I don't understand how someone can say that combing different sources of information is only "potentially" useful. Even if I close both of my eyes I can clearly see that there's no way around that :-)
The major problem that I have with the paper is the experimental
design: I am not quite sure what is being evaluated. Is it the
recommendation system interface or the underlying software used to
create the recommendations? Is it the recommendation system interface or the underlying software used to create the recommendations? If it is the former, which I think it is, then the design of the experiment seems to confound many issues.
I think it's difficult to separate the two. It’s not really possible to evaluate the user interface without considering limitations of the underlying recommendation system. The way the user interface deals with these limitations and presents these shortcomings to the user is a very critical aspect of systems using state-of-the-art content-based algorithms. This is something we explicitly dealt with in the long version of the paper and briefly mention in the short version (e.g. the indicators for how reliable the system thinks the recommendations are). Furthermore, the recommendation system and the interface are very closely linked to each other (e.g. the way the users are given the option to adjust the aspect of similarity they are most interested in).
But then again, as Reviewer 1 and 2 have already pointed out, there are limits to the evaluation we present in our paper.
For example, the authors do not control for user expertise, nor do
they control for system issues (e.g., the database not being large
enough to provide a user with the song he or she is seeking).
We gathered and analyzed lots of statistics on the users expertise (in terms of using computers, music interfaces, and musical knowledge), music taste, and general music listening & discovery habits, but didn’t include everything because we ran out of space. Nevertheless, we allocated some space in the 4 page version to describe the participants with more detail.
Moreover, conclusions like, and I'm paraphrasing, "the users say they would use it again" are, for the most part, without any normative value.
We never claimed that such a conclusion is normative. Of course we measured user satisfaction in different ways (direct and indirect), but most of the evaluation part of our paper deals with parts of the interface we thought users would like/understand/use but (surprisingly) didn’t. We believe the contribution of our evaluation is to point out a number of directions for future work.
Under what circumstance would they use it (e.g., if they were paid to evaluate it)? It is a stretch to conclude they would use the system if it were part of Amazon-- part of Amazon in what way; in comparison to what; etc.?
I’m really confused by the reviewer’s remarks. Just because (when asked) the users said they would like to use it doesn’t mean that users would really use it. And we never drew this conclusion. Instead we’ve pointed out several (in the longer version even more) limitations of the interface. We never tried to market MusicSun as a finished system, but rather as a prototype from which there is something to learn from.
In addition, the paper is rife with typos and stylistic problems
(e.g., citations are not a part of speech), and the reference section
relies quite heavily on the authors' own work."
It would have been nice if the reviewer would have explicitly mentioned that this is not why he voted for a weak reject. Furthermore, there’s nicer ways of putting this. Neither my co-author nor I are native speakers. It would have been more helpful if the reviewer would have pointed out some of the typos.
Regarding the self-citation: we cited everything we thought was relevant to understand the work we presented. Most of the interface is built on techniques we previously used. We didn’t have room to describe everything, so we referenced it instead. It would have been more helpful if the reviewer would have pointed us to references that are missing, or unnecessary.
Btw, check out these links my colleague Norman Casagrande pointed me to (both from the legendary Phd comics series):
- Paper review worksheet
- Addressing reviewers comments