Monday 25 August 2008

Getting Last.fm Tags for MP3s with Python

Paul (who is already distributing a large chunk of Last.fm tags) and I are planing to include a few slides in our ISMIR tutorial on how to obtain tag data.

Below is some Python code that basically takes an MP3 file as input and outputs a list of Last.fm tags (for both artist and track). The MP3s don't need correct ID3 tags, but they need to be full length (clips won't work).

The Python code uses Norman's command line finger printing client to find the correct artist and track name. The path to the executable needs to be set in the code. Norman supports Win32, OSX Intel, Linux - 32.

The output is written to a file. For each MP3 file passed as argument there are up to two rows in the output file: one for the artist tags, and one for the track tags. Each row has the format: "<mp3filename> <encoded artist or artist/track name> <tag> <score> [<tag> <score> ...]". Tabs are used as delimiters.

The data from the Last.fm API is available under the Creative Commons Attribution-Non-Commercial-Share Alike License.

Btw, special thanks to Eric Casteleijn for various Python recommendations (lxml etc). (Which reminds me that I still need to fix the other Python code I posted.) As usual any feedback is much appreciated.

import subprocess, sys, re, time, urllib
from lxml import etree

FP_CLIENT_PATH = '"C:\\fpclient\\lastfmfpclient.exe"'
MAX_RETRIES_URL_OPEN = 5

def getArtistTrack(mp3FileName): # ret: (artist, track)
command = FP_CLIENT_PATH + ' ' + mp3FileName
pipe = subprocess.Popen(command, \
stdout=subprocess.PIPE).stdout
for line in pipe:
mo = re.search('<url>.*/([^/]+)/_/(.+)<',line)
if mo:
return urllib.quote(mo.group(1)), \
urllib.quote(mo.group(2))
print "ERROR: failed to get artist/track for: " + \
mp3FileName

def crawlTags(url): # ret: [(tag, count), ...]
for i in xrange(MAX_RETRIES_URL_OPEN):
tagCounts = []
time.sleep(1) # be nice!
try:
root = etree.parse(
urllib.urlopen(url)).getroot()
except IOError:
print "(%d/%d) Failed trying to get: %s." % \
(i, MAX_RETRIES_URL_OPEN, url)
else:
for tag in root.iter('tag'):
tagCounts.append(
(tag.find('name').text, \
tag.find('count').text))
return tagCounts

def tags(prefix, items, outStream): # crawl and write
for mp3FileName, item in items:
url = prefix + item + '/toptags.xml'
print url
tagCounts = crawlTags(url)
outStream.write('%s\t%s\t%s\n' %
(mp3FileName, item, '\t'.join(
tag + '\t' + str(count)
for tag, count in tagCounts)))

def main():
if len(sys.argv)<3:
print 'USAGE: python getTags.py ' + \
'<outFile> <f1.mp3> [<f2.mp3> ...]'
sys.exit(2)
outFile = sys.argv[1]
mp3FileNames = sys.argv[2:]
artists = set()
artistTracks = set()
for mp3FileName in mp3FileNames:
print 'Fingerprinting: ' + mp3FileName
artist,track = getArtistTrack(mp3FileName)
artists.add((mp3FileName, artist))
artistTracks.add((mp3FileName,
artist + '/' + track))

print 'start crawling tags'
o = open(outFile,'w');
tags('http://ws.audioscrobbler.com/1.0/artist/', \
artists, o)
tags('http://ws.audioscrobbler.com/1.0/track/', \
artistTracks, o)
o.close()

if __name__ == "__main__":
main()

No comments: