Thousands of 404 Errors With Googlebot Only
-
I have a serious problem which is directly related to the 404 problem that I mentioned in a previous post.
Just recently I began to use WP Super Cache. To my knowledge, everything is set up properly, and your site “Is My Blog Working” seems to verify this fact. Furthermore, when I log out from my blog, pages load quite quickly compared to when I am logged in.
I am currently running WordPress as a feature of one of my domains, and all of WP’s files and folders are located in a folder called “Blog” which is located at the top level of the root directory for the domain in question.
I am using a recent version of Apache, along with recent versions of PHP and mySQL.
In order to keep tabs on what is happening with the server and the four domains and other services that I run on the machine, I am using a Unix binary called “MultiTail” which displays the Apache access log on-screen in a live, automatically-scrolling window. I have the different status codes color-coded in the MultiTail config file, so that different lines in the on-screen log display in different colors. In the case of 404’s and 500’s, I use red.
It was just this setup which helped me to recently identify a problem with my blog. More correctly stated, it is a problem with thousands of 404 errors being generated by the Googlebot. There is absolutely nothing wrong with my blog, or with the four domains that I run on the server. Of this I am almost 100% certain.
The problem is this: The Googlebot repeatedly attempts to access directories on my server — and more specifically, in the “Blog” folder — which do not exist. My hunch — and it is only a hunch — is that somehow the Googlebot got wind of the fact that WP Super Cache creates static HTML files — and I assume, static folders — on my hard drive, where your plug-in stores those static files until the cache is purged, and the files are refreshed with new ones. I don’t know where that is, or even if I am right about this. As I said, it is only a hunch on my part.
I have my Apache access log set up so that in addition to other info, it displays on-screen the path that the HTTP request is making. Concerning the Googlebot, here is a sample of what is happening. All of these are real examples. Where you see “tag-name-here”, it is an actual tag; and where you see “post-name-here”, it is the actual name of a post:
200 /Blog/2011/07/26/post-name-here/
200 /Blog/2011/07/26/tag-name-here/
404 /Blog/2011/07/26/<
404 /Blog/2011/07/26/post-name-here/<
404 /Blog/page/32/<
404 /Blog/tag/tag-name-here/<
404 /Blog/tag/tag-name-here/page/2<
404 /Blog/tag/tag-name-here/tag-name-here/<
404 /Blog/wp-content/<As you can see, whenever the server sends out a 404 to Googlebot, there is a “<” at the end of the log entry. However, whenever Googlebot makes a successful request, there is no “<” at the end of the line, and the server sends a 200 status code.
The actual Apache error log on my hard drive doesn’t even show the above paths. All it shows is something like this:
[Wed Jul 27 19:13:48 2011] [error] [client 66.249.71.236] File does not exist: /Applications/MAMP/htdocs/Content
Please note that this problem only occurs with my WP blog, and not with the rest of the site, or with any of the other sites either; and it only seems to happen with Googlebot, and not with any of the many other bots which hit my server every hour.
In one case, as you can see by the above example, the Googlebot attempted to dive into my “wp-content” folder, and drilled down to the openID plug-in. A 404 was returned in that case.
So, do you have any idea why some Googlebot requests are successful, while others are not?
Is this problem related to WP Super Cache as I suspect?
Could it be, as I suggested earlier, that Googlebot is trying to access static files which no longer exist because WP Super Cache has already purged them?
I have the expiry set to 1800.
Also, as per your instructions which I believe I read on your site somewhere, I removed all of the data from the “Rejected User Agents” so that it is now empty.
Please note that I want to continue to allow Googlebot to spider our site, and my blog. I do not want to block Googlebot. I just want to stop all of these 404 errors.
Please understand that this problem is of great concern to me, and here’s why:
In visiting the Google Analytics page for this domain, there are now literally thousands of 404 errors because of this Googlebot/blog problem, and I am concerned that it is going to affect our PageRank if I don’t correct it as soon as possible.
If you have any ideas, I’d really appreciate some feedback.
I really don’t wish to reveal the inner workings of my server publicly, so if you have any sensitive questions along that line, I’d appreciate it if we could discuss it privately.
Thanks so much!
- The topic ‘Thousands of 404 Errors With Googlebot Only’ is closed to new replies.