How to Read Remote MP3 File Info Without Downloading Entire File
I’d like to share technique I came up with when I needed to index a lot of mp3 podcasts located on remote servers. It would have taken weeks if downloading entire files, count was on tens of thousands.
Example is in Python, but any other programming language could be used to perform these steps.
In a nutshell what was needed to be done:
- Read ID3 tags
- Get podcast duration
- Get file size
- Get MP3 bitrate
I managed to solve this by reading only first 40kb of a file, and if required last 128 bytes for ID3v1 tags.
Process is divided in 6 logical steps
- Download first 40kb
- Save them to a temporary file
- Read real file length from HTTP response
- Resize temporary file
- Rewind file pointer to -128 and read last 128 bytes using Range Request
- Finally read all the information ID3 info, bitrate, duration
To read ID3 info and calculate bitrate and song duration i used nice library called Mutagen
First i will illustrate how to download info we need and create temporary mp3 file with just the data we need.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | import httplib, urllib from urlparse import urlparse def downloadMp3(self, mp3url): params = {} parts = urlparse(mp3url) useragent = 'podcast-fetcher-example' allowedmimes = ['text/html'] allowedextns = ['', '.mp3'] maxcontentbytes = 40000 headers = { 'User-agent': useragent, 'Accept': ','.join(allowedmimes), 'Range': 'bytes=0-%(maxcontentbytes)d' %locals() } try: conn = httplib.HTTPConnection(parts.netloc) path = parts.path conn.request("GET", path, params, headers) response = conn.getresponse() # To read real file size contentLen = 0 # If Range is not supported, just download first max bytes if response.status == 200 : if response.getheader("content-length") != None : contentLen = int(response.getheader("content-length")) # Handle Redirect elif response.status == 302 : newurl = response.getheader("location") self.downloadMp3(newurl) # Range response elif response.status == 206 : """ Response example 'content-range', 'bytes 0-40000/3796992""" field = response.getheader("content-range") contentLen = int( field[field.find("/")+1:] ) # We are not handling errors here if response.status > 299 : conn.close() return False # Size if not big enough to calculate bitrate if contentLen < maxcontentbytes: conn.close() return False data = response.read(maxcontentbytes) # create/open temporary file file = open("temp.mp3", 'wb') # truncate if exists already file.truncate(0) file.write(data) # make space for ID3v1 if any file.seek(contentLen - 128, 0) response.close() # if supports partial request we read last 128 bytes for ID3v1 if response.status == 206 : conn = httplib.HTTPConnection(parts.netloc) # Range: bytes=-128 will read is last 128 bytes headers2 = { 'User-agent': useragent, 'Accept': ','.join(allowedmimes), 'Range': 'bytes=-128' %locals() } conn.request("GET", path, params, headers2) response2 = conn.getresponse() file.write(response2.read(128)) response2.close() # otherwise just append 128 to file else : file.seek(128, 1) file.close() conn.close() except Exception, msg : conn.close() print self.name + "Error downloading info: " + str(msg) return False return True |
And having all data we need, we read meta info with mutagen.
Note that mutagen returns data as Unicode strings
1 2 3 4 5 6 7 8 9 | if downloadMp3(url): from mutagen.easyid3 import EasyID3 from mutagen.mp3 import MP3 audio = MP3("temp.mp3", ID3=EasyID3) title = audio.get("title", [""])[0].encode("ISO-8859-1") artist = audio.get("artist", [""])[0].encode("ISO-8859-1") album = audio.get("album", [""])[0].encode("ISO-8859-1") bitrate = audio.info.bitrate filesize = round(audio.info.length) |
Et voilĂ , we have all the info we needed while we spent only little more than 40kb of bandwidth.