Support » Plugin: The SEO Framework » Sitemap contains urls which are blocked by robots.txt.

  • Resolved Terence

    (@pubdirltd)


    Hi Sybre,

    I submitted a sitemap, with 53 URLs, for a new site. But Google webmaster tools told me that all 53 were blocked by robots.txt. So I checked and this is what I found, which looks rather odd to me.

    User-agent: *
    Disallow: /wp-admin/
    Allow: /wp-admin/admin-ajax.php
    Disallow: /wp-includes/
    Disallow: /*?*
    
    Sitemap: http://wpexpert.support/sitemap.xml

    What do you think?

    https://wordpress.org/plugins/autodescription/

Viewing 15 replies - 1 through 15 (of 22 total)
  • Plugin Author Sybre Waaijer

    (@cybr)

    Hi Terence,

    That’s odd, there’s nothing wrong with either your URL’s, the pages they output or the robots.txt.

    The SEO Framework only adds the following lines:

    Disallow: /wp-includes/
    Disallow: /*?* #filterable, disallows queries
    
    Sitemap: http://example.com/sitemap.xml #listens to options

    Have you recently changed the permalink structure, or the way AnsPress is configured? It might be so that AnsPress previously has added queries through some configuration.
    Whatever the case, I recommend to block queries because the outcome on Google is variable and might cause users to leave. This is what the /*?* line does.

    What I recommend to do is to ask Google to re-index your URL’s. Give it a day and try searching Google with the following query:
    site:wpexpert.support
    If all is well, Google will output all index-able URL’s with the descriptions.
    If not, I recommend to add a random page just to see if it’s indexed at all.

    I have just submitted the request for you. (https://www.google.com/webmasters/tools/submit-url)

    I think Google Webmaster tools also adds time stamps for when some problems were found. I also think you can mark them as solved, the problems will re-appear automatically if they still persist.

    I hope this helps! Enjoy your day 🙂

    Hmmm, methinks there’s more here than immediately meets the eye.

    Yes, I had recently updated the version of AnsPress AND I had also switched off ‘Discourage search engines from indexing this site’

    So, all these could be in play.

    But if I take what you are saying as correct, I still don’t know where these two line are coming from ~

    Allow: /wp-admin/admin-ajax.php
    Disallow: /wp-includes/

    To find out what’s screwing with my robots.txt I disconnected SEO Framework, and when I did that, my robots.txt became just this ~

    User-agent: *
    Disallow: /wp-admin/
    Allow: /wp-admin/admin-ajax.php

    And when I reactivate SEO Framework, it looks like this ~

    User-agent: *
    Disallow: /wp-admin/
    Allow: /wp-admin/admin-ajax.php
    Disallow: /wp-includes/
    Disallow: /*?*
    
    Sitemap: http://wpexpert.support/sitemap.xml

    So that seems fairly conclusive, but I still don’t know what’s adding the first two lines.

    Also, as a point of interest, when I deactivate SEO Framework, and used this URL ~ http://wpexpert.support/questions/question/purchased-theme/ ~ to do a fetch and render on Google’s Search Console, I got a status of “complete”. But when I reactivated it, and did a fetch and render with the same URL I got a status of “partial” ~ http://screencast.com/t/YdmD7BEcmU

    I don’t think I have got to the bottom of this yet. Not by a long way.

    Terence.

    It seems that

    Disallow: /wp-includes/

    is what’s causing the blocking ~ http://screencast.com/t/dkhDo1OvI

    And I did find one that’s being caused by

    Disallow: /*?*

    http://screencast.com/t/WTsT1atUH

    Having had Google re-scan the sitemap, it seems there are still 55 URLs which are blocked in SEO Framework ~ http://screencast.com/t/avECjendAU

    These can only be the pages I set to no-index as they have no SEO value, like terms and conditions, privacy policy, search pages etc.

    But it looks like, these days, Google is demanding to see everything and then make their own mind up.

    Is this the beginning of the end for SEO, I wonder?

    Now I am totally confused.

    I removed all the no-index, no-follow, no local-search flags from all the admin URLS, and now Sitemap contains 66 urls which are blocked by robots.txt.

    VERY strange.

    Plugin Author Sybre Waaijer

    (@cybr)

    Hi Terence,

    There’s a lot to process here, so I’m going down the tree with blockquotes :).

    Yes, I had recently updated the version of AnsPress AND I had also switched off ‘Discourage search engines from indexing this site’

    That’s good, was it so that you had blocked indexing of the site before, while you were still building it?

    Google takes his time to re-crawl your website, so slowly, but surely, your links will be visible again to the public on Google Search.

    Keep the site:wpexpert.support search query in mind. It will show you what’s indexed.

    I already see 8 more links than 10 hours ago.

    I’m not really sure on how frequently you may ask Google for re-indexing, but keep in mind that Google can ignore you. I have just also filed a re-index request to Bing.

    But if I take what you are saying as correct, I still don’t know where these two line are coming from ~

    User-agent: *
    Disallow: /wp-admin/
    Allow: /wp-admin/admin-ajax.php

    Those are coming from WordPress 4.3 default. WordPress has a robots.txt filter in place (robots_txt) which The SEO Framework makes use of. If you place a manual robots.txt file in your root web folder, the filter will stop working and the robots.txt file in your root folder will be output.

    The SEO Framework just adds to them. And in fact it will conflict as of this moment with other robots.txt parser plugins. A fix for this is unrelated, yet planned for The SEO Framework 2.4.4.

    It seems that

    Disallow: /wp-includes/

    is what’s causing the blocking ~ http://screencast.com/t/dkhDo1OvI

    You are absolutely correct in some way, it blocks all files from wp-includes. The reason I’ve done this was to block redirecting with older WPMUdev Domain Mapping versions from Google, it’s my fault I didn’t leave it out on this public version.

    However, this shouldn’t cause any problems, because it’s merely blocking the jQuery files on your site, which Googlebot actually doesn’t need to render jQuery.
    It doesn’t block actual pages.

    And I did find one that’s being caused by

    Disallow: /*?*

    http://screencast.com/t/WTsT1atUH

    So it works as intended! However, we’re now in the world of CDN cache busting, so I should make it more specific (to still allow .svg, .png, etc.).
    This will also be put in place for The SEO Framework 2.4.4.
    But once again, this doesn’t block actual pages.

    Having had Google re-scan the sitemap, it seems there are still 55 URLs which are blocked in SEO Framework ~ http://screencast.com/t/avECjendAU

    This shouldn’t happen! Read the recap below on this one.

    But it looks like, these days, Google is demanding to see everything and then make their own mind up.

    That’s correct! Google will ignore some parts of the robots.txt file just to correctly render allowed pages.

    Now I am totally confused.

    I removed all the no-index, no-follow, no local-search flags from all the admin URLS, and now Sitemap contains 66 urls which are blocked by robots.txt.

    VERY strange.

    That’s correct, within the URL count of the URLs which are blocked, files are also counted. For example the ones with query strings or from /wp-includes/.

    To recap:

    1. Google should index every page according to your robots.txt file.
    2. Google will use everything it needs to render each page correctly, although maybe blocked by the robots.txt file.
    3. Google will not index the pages or files which are blocked by the robots.txt file, although some special search queries will still reveal them. e.g. site:wpexpert.support.
    4. Google might cache the robots.txt file, or any page for that matter.
    5. Because Google caches everything, some requests may be outdated. The noodp and noydir helps to limit these mistakes, but not all mistakes
    6. Google will revisit your pages when you ask her to.
    7. Google will limit its crawls, although you might ask her to do it more often.

    And last but not least:
    8. Give Google a few days to correctly reindex everything.

    P.S.
    Use this filter to remove the query args elimination (/*?*):
    add_filter( 'the_seo_framework_robots_allow_queries', '__return_true' );

    Sybre, thank you for taking the time to give me such a clear and detailed answer.

    Yes the site was ‘hidden’ from Google before, but now the gates are open.

    site:wpexpert.support produces some telltale results. It seems Google had already visited the site and indexed stuff I didn’t want it to.

    No doubt a trip to the search console will put that right.

    Plugin Author Sybre Waaijer

    (@cybr)

    Hi @terence,

    When I search through Google for your site, it seems that Google has indexed everything well again :). It just took a few hours.

    Bing is a bit slower on that part, so that will take a while longer.

    Hope this is solved on your part!
    Do expect a robots_txt overhaul in the next update, which was starting to get planned two days ago actually 🙂 This overhaul will make sure all your images are also correctly indexed!

    Thanks and have a great day. Happy holidays!

    What, even the 100 zero length base64 Gravatars? 8^)

    Plugin Author Sybre Waaijer

    (@cybr)

    Nope, that’s from a different domain, so a different robots.txt file 🙂

    The robots.txt is only for the (sub-)domain in question.

    So for instance, if you load a file from a different domain, e.g. a CDN or Gravatar, the robots.txt file for that domain is being read and used for that file.

    I hope this clears things up 🙂

    I knew that… 😉

    I just checked the search console and I don’t appear to be doing very well ~ http://screencast.com/t/Yct1nmuKlr0T

    Plugin Author Sybre Waaijer

    (@cybr)

    I believe it needs some time to sync everything to the webmaster console 🙂

    Google has correctly indexed your website, and Bing is slowly starting to add links.

    Please note that getting ranked “by tomorrow” is very rare, it’s rather a few weeks to a month or three.

    P.S. “Blocked by robots.txt” also doesn’t necessarily mean that it’s blocked by the robots.txt file, but it could also mean it’s blocked by the on-page robots Meta tag. I checked those and everything’s fine 🙂

    Maybe not tomorrow, but how about the day after?

    And I am not sure you are correct about it taking time to sync, but we will see… 8^)

Viewing 15 replies - 1 through 15 (of 22 total)
  • The topic ‘Sitemap contains urls which are blocked by robots.txt.’ is closed to new replies.