Looking at the server logs for this weblog a few minutes ago, I noticed the Everest crawler from Vulcan (which namechecks owner Paul Allen on that page) downloading pages. I also notice that they were distressingly frequent, fetching pages 30K and larger every few seconds. I filled out their feedback form, letting them know that I have denied them at the server level as long as they are running at this high of a resource burden. I always thought one minute was the most frequent ethical limit on crawler accesses, and I consider hitting more often than once every ten seconds to be clearly abusive.
Am I alone in being concerned about this? More and more I see crawlers that hit at this level. Does everyone one of these crawler authors think they are the only one out there? When you have several dozen crawlers all hitting your site every few seconds, it becomes a big issue for an average citizen. I get a little pissed off when I have to increase the size of my iron just to service your fricking abusive swarms of robots. Uncool, dudes, uncool.
Back when I last did crawler programming with Perl’s RobotUA module, it’s default was to enforce that you couldn’t hit the same domain more than one minute. Has this completely dropped out of the radar? I think anyone building a robot or a crawler or even a crawler module should institute this minimum as a default. As crawlers you are guests on these servers, so be good ones. Nothing sucks worse than a project whose value depends one the resources of others but is then a shitty steward of them. I just did one of these projects that involves consuming RSS feeds and when I can I use the modification times to avoid fetching feeds I can avoid. I try not to not fetch too often, even though my project’s responsiveness would be improved by checking everyone’s feed more often. We’ve all got to coexist, so sometimes you have to bust out the golden rule.
Update: A member of the Everest team was stand up enough to leave an apologetic comment on this post, for which I thank them. It should also be noted that they were ethical enough to have an identifiable UserAgent that allowed me to find them. Since I wrote this, I have had two others with just default Java strings in there hit at a much higher level than Everest. Uncool, uncool.