What Happened to Crawler Etiquette?

Looking at the server logs for this weblog a few minutes ago, I noticed the Everest crawler from Vulcan (which namechecks owner Paul Allen on that page) downloading pages. I also notice that they were distressingly frequent, fetching pages 30K and larger every few seconds. I filled out their feedback form, letting them know that I have denied them at the server level as long as they are running at this high of a resource burden. I always thought one minute was the most frequent ethical limit on crawler accesses, and I consider hitting more often than once every ten seconds to be clearly abusive.

Am I alone in being concerned about this? More and more I see crawlers that hit at this level. Does everyone one of these crawler authors think they are the only one out there? When you have several dozen crawlers all hitting your site every few seconds, it becomes a big issue for an average citizen. I get a little pissed off when I have to increase the size of my iron just to service your fricking abusive swarms of robots. Uncool, dudes, uncool.

Back when I last did crawler programming with Perl’s RobotUA module, it’s default was to enforce that you couldn’t hit the same domain more than one minute. Has this completely dropped out of the radar? I think anyone building a robot or a crawler or even a crawler module should institute this minimum as a default. As crawlers you are guests on these servers, so be good ones. Nothing sucks worse than a project whose value depends one the resources of others but is then a shitty steward of them. I just did one of these projects that involves consuming RSS feeds and when I can I use the modification times to avoid fetching feeds I can avoid. I try not to not fetch too often, even though my project’s responsiveness would be improved by checking everyone’s feed more often. We’ve all got to coexist, so sometimes you have to bust out the golden rule.

Update: A member of the Everest team was stand up enough to leave an apologetic comment on this post, for which I thank them. It should also be noted that they were ethical enough to have an identifiable UserAgent that allowed me to find them. Since I wrote this, I have had two others with just default Java strings in there hit at a much higher level than Everest. Uncool, uncool.

Published by

dave

Dave Slusher is a blogger, podcaster, computer programmer, author, science fiction fan and father. Member of the Podcast Hall of Fame class of 2022.

5 thoughts on “What Happened to Crawler Etiquette?”

  1. Nope, I hate them too. And I have to add, I don’t care about smaller services if they don’t treat me nicely – they can’t expect me to give them my servers ‘attention’ all the time.

    When I have some more time at my hand I was actually going to start banning several of them – because they do not play nice and I do get no value in return from they – except being hit and used up my traffic.

    There are some annoying ones out there which do stupid fetching as well.

  2. Dave,

    The behavior of the everest crawler is regrettable. You are the second complaint we have received. We agree it is improper and we have shut down the crawler until we can fix this bug which recently emerged.

    We apologize for the inconvenience and wasteful use of your resources. We will not request opening your site back up to us until we’re 100% confident this will not happen again.

  3. It’s not as though there aren’t a million other sites to request a page from while allowing a courteous pause to any given site. Everest dropped by here 29/Nov/2005:23:02:46 -0600 but just picked up robots.txt and /

  4. I have several ‘Java’ IPs that visit often and chew through the site as fast as they can request pages. I guess I need to cook up a little script to deal with these pests.

  5. Found the silver bullet! Michael Hampton has written a plugin called Bad Behavior. It’s designed to stop evil bots in their tracks. http://www.ioerror.us/software/bad-behavior/
    I just activated it. Let me know if it stiffs the AmigoFish checker & I’ll whitelist it. I doubt you wrote a badly behaved bot, though. Obeying robots.txt and not lying about what you are should get you in the door unless you are a known abusive bot.
    AFAIK, it doesn’t specifically target aggressive page requests, but the bots that most annoy me do other bad things that should get them the boot.

Comments are closed.