Support » Plugin: Autoptimize » MANY 404 errors – but ONLY from googlebot

  • Resolved thefitrv

    (@thefitrv)


    Here’s the deal. My 404 log is filling up with requests for things like this:
    /wp-content/cache/autoptimize/js/autoptimize_3262629d5d3427a82464844a0d29c75a.js
    /wp-content/cache/autoptimize/js/autoptimize_0c00ec59d9b582cb71bcc0d9e2a1e600.js
    /wp-content/cache/autoptimize/js/autoptimize_78f5f9ff7310c387db0c7ec7b18d057b.js

    and so on, and so on. So numerous as to make the 404 logs quickly meaningless.

    These requests come ONLY from googlebot. And none of them have a referrer.

    Yes, the site does use WP Super Cache. But those links do not currently exist on my site. The site and the cache are in perfect sync. It is not possible to get these errors by looking at the live site.

    Here is how this is happening. Google is crawling my site and saving the pages. Sometime after Google crawls the page, I update the site, and the cache and autoptimize cache refresh. The site is perfect.

    However, some time later, Google comes back and tries to crawl all the links on the page it saved two days ago. These links (as above) no longer exist, and I get gobs of 404 errors – only from Googlebot, because only Googlebot saved the page.

    Previously, I worked around this with robots.txt by preventing Googlebot from crawling wp-content/cache. This eliminates these errors.

    However, with the new “mobile friendly” edict from Google, if I do not allow wp-content/cache to be crawled, the site appears mobile-un-friendly.

    Do you have any suggestions for eliminating these 404 errors? Google only visits so many times a day, and now half of those seem to be 404s.

    Thanks

    https://wordpress.org/plugins/autoptimize/

Viewing 11 replies - 1 through 11 (of 11 total)
  • Plugin Author Frank Goossens

    (@futtta)

    Now that’s a nice juicy problem to sink our teeth in thefitrv 😉

    The problem indeed seems to be that Google is crawling asynchronously (HTML first, CSS and/or JS some time later). If “some time” is days or even weeks and if in the mean time you purged Autoptimize’s cache (when doing a change in the config), then indeed Google will get a 404.

    There are 2 approaches trying to solve this:
    1. making sure the files don’t get removed; when updating AO-config press “Save changes” instead of “Save changes and empty cache”. Disadvantage: the size of your AO-cache will go up.
    2. making sure Google does not get to see url’s to AO-files; add a rewrite-rule to append ?ao_noptimize=1 if UA=GoogleBot (which, if you have the querystring-option checked, should also prevent WP Super Cache from returning the page from cache. Disadvantages: Google will not get the optimized version and you’re going to litter Google’s search results with ?ao_noptimize.

    A 3rd alternative could be create js/.htaccess and tell it to redirect 404’s to a to be created js/found.php and have that issue a HTTP 200 while returning a random (?) autoptimize_xyz.js-file (and do the same for CSS-files). This will require development, but seems to be the best solution really. So are you into development? 🙂

    frank

    Thread Starter thefitrv

    (@thefitrv)

    Thanks for the quick reply!
    I don’t do as much development as I used to, but I could probably give #3 a shot.

    Ideally, it would work only in /wp-content/cache/autoptimize/, and also only for Googlebot. (In case there ever is a problem with truly out-of-sync autoptimize files requested by a real user, I would still want to know of that).

    There is a separate .htaccess file in the /wp-content/cache/autoptimize/ directory. Is it OK to experiment with that? That will work for experiments, but it’s not a permanent solution.

    It won’t work permanently because, apparently, every time I update a post or page, WP-Super-Cache blasts the whole wp-content/cache directory. That’s how my autoptimize files turn over so quickly in the first place. I can see new directories and new .htaccess file created at the time I update a post. But I can experiment with that file and let you know what I come up with.

    Plugin Author Frank Goossens

    (@futtta)

    Great!

    wp-content/cache/autoptimize/.htaccess would be the right place for the 404 to be handled, but you can make the change to wp-content/plugins/autoptimize/classes/autoptimizeCache.php, which recreates the .htaccess if non-existing.

    weird that WPSC zaps your entire wp-content/cache dir, doesn’t do that for me, but that might depend on WPSC-settings?

    frank

    Thread Starter thefitrv

    (@thefitrv)

    It has to do with the WP Super Cache setting “Clear all cache files when a post or page is published or updated.”

    I have that enabled because we often need other pages re-cached when we update. With it enabled, EVERYTHING is deleted from wp-content/cache. If it is not enabled, then only the updated pages get purged.

    Do you think this is a bug in WP Super Cache? It seems like a plugin should only delete its own files… not another plugins. Even if those files do reside in wp-content/cache.

    Plugin Author Frank Goossens

    (@futtta)

    no, I don’t consider that a bug in WPSC really, it’s a feature 🙂

    you could use AO’s API to change the place where AO’s cache is kept, awau from the overzealous WPSC, cfr. example code in the FAQ?

    Thread Starter thefitrv

    (@thefitrv)

    Well, that would certainly be easier…
    Although I’d still get the 404s when I made changes to .js or .css that required an AO cache rebuild. Those changes are much less frequent though.

    Maybe I’ll do both?

    Either way, thanks for the tips! I’ve got plenty to work with now.

    Plugin Author Frank Goossens

    (@futtta)

    I’ve been looking into a 404-fallback solution as well yesterday and can share proof-of-concept code if you’re interested. Given it’s reliance on .htaccess it is pretty Apache-specific, but as you’re on Apache that shouldn’t bother you too much, now would it?

    Thread Starter thefitrv

    (@thefitrv)

    Interesting. If you’ve got something already, I’d certainly be willing to have a look and maybe test it out.

    I’ll email you, rather than share details here.

    Plugin Author Frank Goossens

    (@futtta)

    update; thefitrv & I exchanged ideas and he ended up editing .htaccess (and autoptimizeCache.php, which writes the .htaccess if removed) to send a “410 gone”-response to GoogleBot instead of a 404 to avoid the Google-404’s cluttering his 404-reporting.

    frank

    Frank, is it possible you share the solution? What should be the syntax of that htaccess? I am still hesitating between 410’s and redirecting to any existing css/js file in the cache.
    I have tons of 404 for the cache objects.

    Plugin Author Frank Goossens

    (@futtta)

    this one might help Grzegorz

    have fun,
    frank

Viewing 11 replies - 1 through 11 (of 11 total)
  • The topic ‘MANY 404 errors – but ONLY from googlebot’ is closed to new replies.