• Resolved bbole

    (@bbole)


    Hi
    This is more a question of how to approach llms.txt implementation than about the plugin per se.
    Does the plugin generate a .md version of every single blog article?
    If so, how to prevent nonAI bots like Googlebot etc to index also these pages and serve them as urls in results? or risk of a penalty for duplicated content? blocking these md versions for those bots? not serving them in the ‘normal’ sitemap? (in our case automatically generated by Rank Math)

    Also, let’s assume that yes Answer Engines pick the md content, and retrieve it and use it in their answers, ok, but they also add a link as a citation link: would that link be the md link in the llms.txt? so the human reader would potentially land on the not that nice/friendly md version of a landing page rather than the html?

    Thank you for clarifying this 🙂

    Have a good day
    Elena

Viewing 11 replies - 1 through 11 (of 11 total)
  • Plugin Author Ryan Howard

    (@ryhowa)

    Hi Elena,

    Google and other bots do not index llms.txt files or serve them in SERPs. We have a feature on the next release to monitor if/ when AI search crawlers are accessing llms.txt files.

    Thread Starter bbole

    (@bbole)

    Hola Ryan,

    Thanks a lot for responding! How do we know 1”% that Google’s and other ‘traditional’ bots won’t index such files?

    Also regarding this plugin: does it generate a markdown version of each article too?

    Thank YOU

    Thread Starter bbole

    (@bbole)

    Also @ryhowa asking because https://beebole.com/blog/sitemap_index.xml is added to GSC, Bing Webmasters, etc and the new llms sitemap is there.

    Thread Starter bbole

    (@bbole)

    Why is such sitemap-llms created after all? it’s not part of the procedure suggested by Jeremy Howard https://llmstxt.org/

    The llms.txt file is definitely being indexed by Google. I added it to 20 new websites I recently launched, and on most of them, a site: search shows the llms.txt file in the results. So, the risk of duplicate content is definitely real.

    Another thing I noticed this morning: when I publish a post, I need to clear the plugin cache for it to appear in the Rank Math sitemap.

    When I deactivate the plugin, I still see the ‘Website LLMs.txt’ sitemap listed in Rank Math, but it leads to a 404 error. Even if i remove the plugin it stay there.

    • This reply was modified 5 months, 3 weeks ago by beeexquise.
    • This reply was modified 5 months, 3 weeks ago by beeexquise.
    Plugin Author Ryan Howard

    (@ryhowa)

    @bbole llms.txt for generative engine optimization is experimental. The plugin does not generate md per post as in Jeremy’s proposal and we won’t be extending to support md until we have strong evidence that doing so is worth dev time.

    @beeexquise – Send me an example please. If you don’t want to post publicly, you can provide it here: https://www.websitellm.com/listing/

    Thread Starter bbole

    (@bbole)

    Most people have sitemap index added to Google Search Console, etc. After activating this plugin, there’s a new sitemap for llms added to that sitemap, when it shouldn’t be. That’s why it’s picked by GSC and other webmasters tools.

    I have deactivated the plugin for that reason.

    • This reply was modified 5 months, 3 weeks ago by bbole.
    Plugin Author Ryan Howard

    (@ryhowa)

    Discoverability and indexation are different.

    Thread Starter bbole

    (@bbole)

    Absolutely. While discoverability refers to a file being found by a crawler (e.g., via links, sitemaps, or direct access), indexation is indeed a prerequisite for a page to appear in search engine results, which is a key component of discoverability by traditional search engines. The llms.txt specification is designed specifically for large language models (LLMs), not for traditional search engine indexing. Including these files in the sitemap index increases the likelihood that Googlebot will crawl and index them, treating them as regular web pages.

    Since these files (especially .md versions of blog posts) often have content identical or near-identical to their HTML versions, Google may flag them as duplicate content. This could dilute our SEO rankings, confuse search algorithms, or even confuse people if they ever click on such results or lead to penalties in Google Search Console, as it may interpret the .md files as alternate versions of the same page without proper canonicalization.

    By excluding llms.txt, llms-full.txt, and related .md files from the sitemap index, we reduce the risk of traditional search engines indexing them, thereby minimizing their discoverability in search results.

    This aligns with the llms.txt proposal, which intends these files to be accessed directly by LLMs or AI agents (e.g., via https://example.com/llms.txt) rather than surfaced in Google’s search results. LLMs don’t rely on sitemaps for discovery.

    So:
    – Excluding from sitemap indexes
    – Testing using robots.txt rules to allow specific AI crawlers while disallowing traditional crawlers.
    – Testing canonicals

    Reduce the likelihood of anything described above happening. To me, the first step is very straightforward: the plugin shouldn’t modify the sitemap index at all. Risk mitigation.

    Here is an example where the llms.txt is indexed: site:https://odysseatia.com, Google says: “To show you the most relevant results, we have omitted some entries that are very similar to the 2 current entries. If you want, you can repeat the search to include the omitted results.” And when you display them, you can see the sitemap llms.txt. Now there is a 404 because I deleted everything, but the site is filtered to only 2 pages after Google had indexed all of them at first. I’m not saying this is caused by your plugin, but there is a strong chance because of the duplicate content and it happened after i added the plugin and setting to full content of post inside the llms.txt.

    I had 2 out of 20 sites that got hit with the 2-page filter, and 5 or 6 others showed the message: “To show you the most relevant results, we have omitted some entries that are very similar to the 2 current entries. If you want, you can repeat the search to include the omitted results.” In reality, there are more like 10 or 15 similar pages, not just 2 in those cases.

    Plugin Author Ryan Howard

    (@ryhowa)

    Okay, so google does index llms.txt.

    In the next release, we’ll remove llms.txt and related files from sitemaps by default. We’ll retain an option to include them manually and add a clear warning about full posts and duplicate content potential.

    Thank you both for flagging this.

Viewing 11 replies - 1 through 11 (of 11 total)

You must be logged in to reply to this topic.