Thousands of 404 Errors With Googlebot Only | WordPress.org

Bill Kochman
(@wordweaver777)

12 years, 8 months ago

I have a serious problem which is directly related to the 404 problem that I mentioned in a previous post.

Just recently I began to use WP Super Cache. To my knowledge, everything is set up properly, and your site “Is My Blog Working” seems to verify this fact. Furthermore, when I log out from my blog, pages load quite quickly compared to when I am logged in.

I am currently running WordPress as a feature of one of my domains, and all of WP’s files and folders are located in a folder called “Blog” which is located at the top level of the root directory for the domain in question.

I am using a recent version of Apache, along with recent versions of PHP and mySQL.

In order to keep tabs on what is happening with the server and the four domains and other services that I run on the machine, I am using a Unix binary called “MultiTail” which displays the Apache access log on-screen in a live, automatically-scrolling window. I have the different status codes color-coded in the MultiTail config file, so that different lines in the on-screen log display in different colors. In the case of 404’s and 500’s, I use red.

It was just this setup which helped me to recently identify a problem with my blog. More correctly stated, it is a problem with thousands of 404 errors being generated by the Googlebot. There is absolutely nothing wrong with my blog, or with the four domains that I run on the server. Of this I am almost 100% certain.

The problem is this: The Googlebot repeatedly attempts to access directories on my server — and more specifically, in the “Blog” folder — which do not exist. My hunch — and it is only a hunch — is that somehow the Googlebot got wind of the fact that WP Super Cache creates static HTML files — and I assume, static folders — on my hard drive, where your plug-in stores those static files until the cache is purged, and the files are refreshed with new ones. I don’t know where that is, or even if I am right about this. As I said, it is only a hunch on my part.

I have my Apache access log set up so that in addition to other info, it displays on-screen the path that the HTTP request is making. Concerning the Googlebot, here is a sample of what is happening. All of these are real examples. Where you see “tag-name-here”, it is an actual tag; and where you see “post-name-here”, it is the actual name of a post:

200 /Blog/2011/07/26/post-name-here/
200 /Blog/2011/07/26/tag-name-here/
404 /Blog/2011/07/26/<
404 /Blog/2011/07/26/post-name-here/<
404 /Blog/page/32/<
404 /Blog/tag/tag-name-here/<
404 /Blog/tag/tag-name-here/page/2<
404 /Blog/tag/tag-name-here/tag-name-here/<
404 /Blog/wp-content/<

As you can see, whenever the server sends out a 404 to Googlebot, there is a “<” at the end of the log entry. However, whenever Googlebot makes a successful request, there is no “<” at the end of the line, and the server sends a 200 status code.

The actual Apache error log on my hard drive doesn’t even show the above paths. All it shows is something like this:

[Wed Jul 27 19:13:48 2011] [error] [client 66.249.71.236] File does not exist: /Applications/MAMP/htdocs/Content

Please note that this problem only occurs with my WP blog, and not with the rest of the site, or with any of the other sites either; and it only seems to happen with Googlebot, and not with any of the many other bots which hit my server every hour.

In one case, as you can see by the above example, the Googlebot attempted to dive into my “wp-content” folder, and drilled down to the openID plug-in. A 404 was returned in that case.

So, do you have any idea why some Googlebot requests are successful, while others are not?

Is this problem related to WP Super Cache as I suspect?

Could it be, as I suggested earlier, that Googlebot is trying to access static files which no longer exist because WP Super Cache has already purged them?

I have the expiry set to 1800.

Also, as per your instructions which I believe I read on your site somewhere, I removed all of the data from the “Rejected User Agents” so that it is now empty.

Please note that I want to continue to allow Googlebot to spider our site, and my blog. I do not want to block Googlebot. I just want to stop all of these 404 errors.

Please understand that this problem is of great concern to me, and here’s why:

In visiting the Google Analytics page for this domain, there are now literally thousands of 404 errors because of this Googlebot/blog problem, and I am concerned that it is going to affect our PageRank if I don’t correct it as soon as possible.

If you have any ideas, I’d really appreciate some feedback.

I really don’t wish to reveal the inner workings of my server publicly, so if you have any sensitive questions along that line, I’d appreciate it if we could discuss it privately.

Thanks so much!

Viewing 3 replies - 1 through 3 (of 3 total)

Plugin Author Donncha O Caoimh (a11n)
(@donncha)

12 years, 8 months ago

I’m 100% sure that this isn’t caused by Supercache. The static files generated by the plugin are usually in wp-content/cache/supercache/hostname/… so Blog/ has nothing to do with it.

HEMROIDS
(@hemroids)

12 years, 8 months ago

Hi there, I too am having this problem and it seems to have occured since updating my blogs to version 3.2.1 !

I’m no expert but could it possibly be that Google sitemap plugins are not up to speed with the recent update or possibly a flaw in the update itself.

The majority, (sorry all!) of my 404’s have been on the /date/ and /page/

I have seen a dip in serps for all sites affected and have experienced this problem previously. I got round it by using a redirect plugin, which had an almost immediate impact on my serps.

Unfortunately though, this isn’t feasible when you are having 100’s of 404’s as it is very time consuming.

I look forward to a solution from somebody!!

Thread Starter Bill Kochman
(@wordweaver777)

12 years, 6 months ago

Hello Hemroids…gosh that must hurt! 🙂

Well, I finally got tired of all of the serious problems that I was having with WP Super Cache on my blog, so as much as I would really like to continue using this plug-in, a short while ago I completely uninstalled it. If you want the specifics regarding my problems, and how I resolved them, please visit the Endtime Prophecy Net Blog. Just google it.

Viewing 3 replies - 1 through 3 (of 3 total)

The topic ‘Thousands of 404 Errors With Googlebot Only’ is closed to new replies.

Tags

googlebot

In: Plugins
3 replies
3 participants
Last reply from: Bill Kochman
Last activity: 12 years, 6 months ago
Status: not resolved