site overwhlemed (shut-down) by spiders and feeders (20 posts)

  1. webdev2
    Posted 8 years ago #

    I have four sites, each with their own installed code base, on one hosting account. My sites go down, all at the same time, from time to time. This last time my host was able to tell me "your site was aggressively retrieved by several indexing/searching sites at one time, Yahoo! DE Slurp, Yahoo! Slurp, msnbot, msnbot-media, Sphere Scout, Feedfetcher-Google, Baiduspider+, Moreoverbot, ScoutJet, Googlebot"

    I would restart the server and it would immediately go down again due to a limit of 15 consecutive connection limit to mysql.

    I see my traffic using google analytics and do not have much traffic at all so I am at a loss about how to prevent this or solve this problem.

    I don't know which pages they are looking at/indexing.

    I read somewhere that you can offload your rss feed to feedburner but I don't know if that would solve this issue or not.

    Any ideas are much appreciated!

  2. rawalex
    Posted 8 years ago #

    Umm, get better hosting. 15 connections is a really, really small number.

  3. hotkee
    Posted 8 years ago #

    and 4 sites is too many on one host.

  4. WP-Super-Cache. Why hit the database at all?

  5. Lester Chan
    Posted 8 years ago #

    4 sites is not really a lot for one host unless those 4 sites have each more than 1K unique per day on them.

  6. whooami
    Posted 8 years ago #

    4 sites is not really a lot for one host unless those 4 sites have each more than 1K unique per day on them

    huh? assuming you mean host == server 4k uniques isnt that much for a well equipped box.

    The thing is that the OP is probably on shared hosting, and it's oversold, of course.

    Traffic is traffic, regardless of whats behind the IP, and like was pointed out, 15 consecutive mysql connections is nothing.

    and even if host == one hosting acct serving multiple domains: 4000 uniques if you are within any bandwidth limitations ...its the bandwidth limitation thats important. 30G a month is just about right for that IF there isnt a lot of media being served.

  7. Lester Chan
    Posted 8 years ago #

    @whooami ops my bad, I mean using the same account on the shared server. I have to agree on the oversold part, it is definitely oversold.

  8. whooami
    Posted 8 years ago #

    :) I think we all agree, time for this person to get a better host. Tell them to wank off, ask for your money back, and move on, webdev2 :)

    Or use wp-supercache, I s'pose.

    But thats probably a band-aid on a larger problem.

  9. webdev2
    Posted 8 years ago #

    thanks Otto. I use WP-Cache Manager now. Is WP-Super-Cache better?

  10. WP-Super-Cache is far better. Especially if you don't have most readers logging into your site.

    Also, WP-Super-Cache incorporates the older WP-Cache functionality right into it. WP-Cache is no longer supported, I believe.

  11. webdev2
    Posted 8 years ago #

    Many thanks Otto!

  12. webdev2
    Posted 8 years ago #

    Otto - Under:
    Rejected User Agents

    Do I need to lease these or remove them if I want to still be found and indexed by everyone who wants:

  13. palamedes
    Posted 8 years ago #

    I would suggest following robots.txt as well ..

    Then parse your logs, any one that isn't following it route them to local host.

  14. webdev2
    Posted 8 years ago #

    palamedes - I appreciate your help but I have no idea what that means.

  15. palamedes
    Posted 8 years ago #

    Ah sorry..


    The robots.txt file is a file you can put on your system that will instruct the various web search bots out there what they can and can't search. More over you can put in a time limit on there that says "only search this often". (google doesn't listen to it, but you can log into their site and control it from there)

    The robots.txt file on my site looks like this;

    User-agent: *
    Crawl-delay: 240
    Disallow: /mint/
    Disallow: /uploads/
    Disallow: /trap/

    The lines that are important for you are the Crawl-delay and the Disallow. Basically it tells robots to not crawl anything in the mint, uploads or trap directory and only to crawl my site every 240 seconds.

    The crawl-delay will do a lot to help keep the bots that follow robots.txt at bay.

    ANY robot that doesn't follow that file, or falls into the trap of going to a directly specifically disallowed will show up in my logs. My log scrapper will then route them to local host:

    route add -host {incoming.annoying.bots.ip} gw

    What this does is it basically tells their bot to search themselves. Usually a bot will hang here .. opening up their tcp port until it times out.. (5 minutes or so).. and costs you nothing ~ its a way to slow'em down or at the very least to give them a "go away" message.

    Search your logs for any IP that is pounding on your site and route it..

  16. whooami
    Posted 8 years ago #

    I prefer:

    ip rule add blackhole from 123.456.789.012

    where the IP matches icky IP

  17. palamedes
    Posted 8 years ago #

    Well that's just an iptables block though..

    My way actually hangs them while their TCP socket times out.. *glee*

  18. whooami
    Posted 8 years ago #

    No, thats a routing command, and has nothing to do with iptables.

    man ip


  19. palamedes
    Posted 8 years ago #

    Ah cool.. Shows you what I know (I've always used route.. heh )

  20. webdev2
    Posted 8 years ago #

    many thanks guys!! very helpful info.

    so far the super-cache is working well. I'll take IP issues if this starts to fail my objectives.

    Many thanks again

Topic Closed

This topic has been closed to new replies.

About this Topic


No tags yet.