I’m going to change the way the caching for the files works in get_enclosures. The way it works now is that when a file is downloaded, the current timestamp is saved. Before a file is downloaded, it is checked to see if there exists a timestamp for it. If so, it is not downloaded. I realize that this case is too simplistic, and I thought of a use case that would make this break while I was thinking of something else that I thought would be cool. But first, a digression.
In this talk of the “iPod platform”, for over two years now I’ve been saving the MP3 files from the WREK streaming archives off for specific shows. I would then burn them to CD and listen to them offline. I did this with custom scripts and Windows scheduled tasks. It occurred to me that this could easily be something that reused all this infrastructure. I realized that it would be quite simple to create a cron task that would write out an RSS feed with enclosures for the various programs on that station. Then, the get_enclosure script could just download them when it was doing its thing anyway.
Here’s where the mechanism described in paragraph 1 falls apart: every week, the URL to get the MP3 archive for that same half-hour of programming is the same. With the existing mechanism, that URL would be downloaded once and only once, the first time the script ran. All subsequent runs would find that URL as one that has already been downloaded. Damn, so close yet so far.
Here’s how that can be fixed, and how perhaps it makes things more robust in all cases. The RSS 2.0 spec defines (requires?) an element for the item, pubDate. I’ve altered the caching mechanism to use this value rather than the current timestamp. Then, when examining whether to get the file it checks the value contained in the pubDate of the item in the current feed versus the one in the cache. If the feed is newer than the cache, get it again. This allows for getting a file down like the WREK situation, where the file name and URL will be reused every week as the contents of the file are rewritten with the new week’s stream. When assembling the RSS 2.0 feed with the enclosure, the pubDate is set to the correct value for that week and everything will work out. Conceivably, this could also allow for redownloading of a file that was edited and republished with everything else the same but the pubDate updated to the new publish time. Because these are textual times, I wrote a simple function that compares two RFC 822 dates and finds out which is the earliest, so for the individual download URLs everything will be used, compared and stored with those dates from the item tag. There are better, more robust ways such as using Date::Manip, but I don’t want to require people to install any more modules than they already do. In fact, I might think about getting rid of the dependence on XML::Simple.
This updated mechanism will be part of the 0.2 release. As well, I will pick a WREK show or three to prepare these experimental feeds for. If they like it and want to do it, I’ll let them have it and they can put it on their own site.