Jeff Hammerbacher from Facebook’s team on data and analytics gave an interesting talk and it can be viewed here (which might be broken, here is a direct link). I’d highly recommend it to anyone interested in what happens behind the scenes on websites dealing with a lot of interesting data.
Some of the stuff he talks about is related to what happens at Last.fm. For example, at Last.fm we also use Hadoop.
I found the part around minutes 33-38 most interesting. He talks about the skills needed to work with their data. He mentions that being able to write code is a must. He mentions that most of the people on his team write more code than they did in their previous positions. And he mentions that very simple statistical tools such as curve fitting and understanding statistical significance can be used to solve most of their learning from data challenges. He talks about how visualizing data is very important (e.g. to identify and understand outliers).
Obviously music recommendation is a much more complex problem than any of the challenges Facebook is facing. Scrobbles, tags, skipping behaviour, etc require very different treatment than the data Facebook gathers. Or maybe not?
To some of the most interesting things I’ve had the pleasure to work on at Last.fm I’ve only applied very basic statistical techniques: non-linear curve fitting and measuring the significance of improvements. However, while the “machine learning” parts could hardly be any simpler, the complexity of dealing with terabytes is something completely different.
Btw, at Last.fm we are hiring someone to work on data and analytics and we also got a position related to data warehousing. Both positions would be facing challenges very closely related to the stuff Jeff talks about. Except that the data we have is a lot more interesting! It made me a bit sad to hear that one of the things they were actually looking at is communication streams between universities… you’d think they’d have a lot more interesting insights to gain? ;-)