XML Sitemap & Google News feeds
Google Webmastertools: News Sitemap is HTML (9 posts)

  1. Loewenherz
    Posted 1 year ago #

    the Google Webmastertools responds an error with the News Sitemap:
    "In your sitemap is obviously an HTML page. Please switch to a supported format for sitemaps."
    The Code looks good for me under http://www.reiki-land.de/sitemap-news.xml/
    Or is it a problem with the slash at the end?


  2. paralyys
    Posted 1 year ago #

    Same problem here. Using Polylang and WP 3.9.1.

  3. RavanH
    Plugin Author

    Posted 1 year ago #

    @Loewenherz - sorry I missed your post before. Are you still having issues? The sitemap (with or without slash, preferably without) looks good... It might have been a caching issue.

    @paralyys - can you share a link?

  4. paralyys
    Posted 1 year ago #

    Webmaster tools screenshot.

    the culprit was "hardcoded" robots.txt :

    sitemap: http://xxx/sitemap.xml
    User-agent:  *
    # disallow all files in these directories
    Disallow: /cgi-bin/
    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Disallow: /wp-content/
    Disallow: /archives/
    disallow: /*?*
    Disallow: *?replytocom
    Disallow: /wp-*
    Disallow: /author
    Disallow: /comments/feed/
    User-agent: Mediapartners-Google*
    Allow: /
  5. paralyys
    Posted 1 year ago #

    But now I have a next problem. the WP generated robots.txt is:

    # XML Sitemap & Google News Feeds version 4.3.2 - http://status301.net/wordpress-plugins/xml-sitemap-feed/
    Sitemap: http://xxx.xxx.com/sitemap.xml
    User-agent: *
    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Disallow: */xmlrpc.php
    Disallow: */wp-*.php
    Disallow: */trackback/
    Disallow: *?wptheme=
    Disallow: *?comments=
    Disallow: *?replytocom
    Disallow: */comment-page-
    Disallow: *?s=
    Disallow: */wp-content/
    Allow: */wp-content/uploads/

    and webmaster tools says there are "Sitemap contains urls which are blocked by robots.txt" - screenshot

    tried adding

    Allow: */sitemap-home.xml
    Allow: */sitemap-posttype-page.xml
    Allow: */sitemap-posttype-post.xml

    to robots.txt through settings but no dice.

    Sorry for the secrecy and thanks for the help.

  6. RavanH
    Plugin Author

    Posted 1 year ago #

    @paralyys - the rules you added should explicitly allow access to these sitemaps. I cannot see why access would be blocked via the current robots.txt rules. Are you sure the old robots.txt is not cached somewhere like a server cache? Sometimes, you as logged in user see something different from anonymous requests. You can use an excellent tool like http://web-sniffer.net (with the option "Raw" enabled) to see what google bot would see. Also, in your Webmasters Tools you can find 'Fetch as Google'. Use this to try to test the robots.txt and different sitemaps...

    If all else fails, you can contact me directly on http://status301.net/contact-en/ to send me the URL of your site privately.

  7. RavanH
    Plugin Author

    Posted 1 year ago #

    By the way, the static robots.txt does not look very different from the dynamic one... It still does not explain why your sitemap urls would be blocked.

  8. paralyys
    Posted 1 year ago #

    Hey: here's the google fetched sitemap.xml from webmaster tools http://pastebin.com/KtDtMZsa and robots.txt http://pastebin.com/Tx3LuaZE

    (I just didn't want to leave the domain up here, the pastebins will decay in a week)
    Maybe the subdomain setup is to blame? I'm really a bit lost here.

  9. RavanH
    Plugin Author

    Posted 1 year ago #

    No, the subdomain is no problem. But I cannot figure out what is...

    Funny thing: the first time I tried to access your sitemap, I got redirected to the English about page. Only after accessing the Estonian pages, I could visit the sitemap. You can reproduce this issue by testing your /sitemap.xml via http://web-sniffer.net (for example) where you can see the response is:

    <title>302 Found</title>
    <p>The document has moved <a href="http://xxx.xxx.xx">here</a>.</p>
    <address>Apache / DataZone Server at xxx.xxx.xx Port 80</address>

    instead of the requested sitemap...

    I wonder if it is a particular setting in Polylang or if you set up a redirect manually? Or is it maybe the fact that there are NO posts in the English language? WordPress is known to behave badly (returning 404 on feeds for example) when there are no posts.

    Try disabling the language slug in post/page URLs and make the home page URL default to the / and /en/ locations. And maybe test the auto-detect visitor language option. Let me know if/when that changes anything :)

Topic Closed

This topic has been closed to new replies.

About this Plugin

About this Topic