• I use the excellent BB plugin from Michael Hampton, as no doubt a significant number of you do as well. The plugin has served me very well over the last year or so and I obviously recommend it very highly.

    My particular BB logtable has shown over time that the two most widespread problems (from BB’s point of view) are:

    1. Required header ‘Accept’ missing.
    2. Header ‘Pragma’ without ‘Cache-Control’ prohibited for HTTP/1.1 requests.

    Maybe this is not the case for some of you but anyway, if I could just get to the point.

    It’s my understanding that (1) is typically caused by misconfigured personal firewalls, proxies, download accelerators and privacy software. For instance, I have found that the customers (or staff) of Time Warner Telecom are by far the biggest culprits.

    The issue which seems somewhat contentious is (2) and I’ll cite PubSub as an example. Their crawler sends a HTTP/1.1 request to our site only to be blocked by BB with a “Header ‘Pragma’ without ‘Cache-Control’ prohibited for HTTP/1.1 requests” error.

    I have reported the issue to them but it seems that they have a much more “liberal” interpretation of RFC2616 then Michael Hampton does. They’re basically saying that the RCF states that a HTTP/1.1 request only should, not must send the “Cache-Control” field, if it sends the (depreciated) “Pragma” field. The upshot basically seems to be that even though the RFC says that “Cache-Control” should be present when “Pragma” is present, they don’t consider their crawler “faulty”. Why? because it doesn’t say “must”, that’s why.

    Michael Hampton sees it from a different point of view: If they’re just going to provide the “Pragma” field without “Cache-Control” then they should be using HTTP/1.0, not HTTP/1.1 which they currently do.

    Of course, BB provides the mechanism to whitelist user agents and IP addresses, so that’s one solution. The other, I guess, is to do nothing and just forget about the potential traffic PubSub may send my way. I guess the question is: Why do I really need traffic from PubSub anyway?

    A very long post, I know.

    My question to all of you BB users out there is how do you handle the issue of (2) above? Who have you found to be the culprits? Have you tried contacting their webmasters and what responses have you got?

Viewing 5 replies - 1 through 5 (of 5 total)
  • Well, it might be nice if BB had checkbox-options to enable or disable certain tests. Maybe for each test citing an example hacker, spammer, bad bot, etc. Especially when it comes to bots you NEED to crawl your site, sucks to whitelist them one by one. At some point, better to let a trickle through, let an anti-spam plugin catch anything spammy, and let the bots that should be browsing your site browse away.

    I just saw Fark blocked from retrieving a posted link to a site, which means that site isn’t getting the traffic it wants from new articles… wonder how many other handmade ‘agents’ get blocked like that.

    -d

    Thread Starter Pizdin Dim

    (@pizdin_dim)

    “let the bots that should be browsing your site browse away”

    I like that. But which bots are they? There are so many of them. I guess it also depends on the webmaster on which bots should be browsing you site and which shouldn’t.

    In my case, I’m very happy with the fact that BB only seems to produce around 5% false positives. I really can’t ask for better than that. In my original post one of the things I tried to highlight is how organisation such as PubSub are basically disinterested in modifying their bot to be compliant with the RFC specs and using the old excuse:

    “should means we don’t have to comply with the recommendations and besides lots of other bots don’t comply so therefore you should modify your anti-spam software because it’s too strict”

    In the end, that’s just an excuse and a rather poor one at that.

    I come from the other camp: even ONE false positive is enough for me to not want to use it. 5% of my hits would be COMPLETELY unacceptable.

    Yeah, the reality is there are more people writing bots, and seeing “should” not “must”, and doing the minimum necessary. It IS up to the anti-spam, anti-bot software writers to deal with the minimum necessary cases and at the least have a way to ‘enable’ more strict checking for those people who want it, and work with a less stringent check otherwise. Just IMHO. 😉

    -d

    I’m as lenient as possible with human browsers, much less so for bots. After all, automated processes are what Bad Behavior is designed to stop.

    With respect to PubSub, they blame it on use of the libcurl library, and so far have not fixed their bot. This may be true, but I see plenty of other bots using libcurl with no problem whatsoever.

    From what I understand, libcurl also refuses to fix the problem from their end.

    The moral of the story is: don’t use libcurl to build a bot.

    🙂 heh.

    The problem is that people use curl figuring it’s a fix-all, they don’t need to implement to rules themselves as libcurl should do so for them. If libcurl won’t play by the rules, lots of app(lets) are going to fail.

    Then again, I have my own http/xml engine I use in CG-Amazon and CG-FeedRead specifically to get around both high-level library problems (like this), and low-level protocol issues (people trying to use url-fopen, and finding it disabled on their host…).

    -d

Viewing 5 replies - 1 through 5 (of 5 total)
  • The topic ‘Bad Behaviour and PubSub’ is closed to new replies.