Support » Requests and Feedback » Add RFC5005 support to WordPress-generated RSS feeds

  • Jamey Sharp

    (@jameysharp)



    RFC5005, “Feed Paging and Archiving”, is a standard from 2007 for publishing the complete history of an RSS (or Atom) feed. I would like WordPress to make full-history feeds available by default. I’ll explain why and then give some initial suggestions on implementation details.

    Many people use WordPress to publish long-form serialized works like webcomics or serial stories, where reading the complete history of the work is necessary for understanding the newest posts. Readers of these works currently have to start by reading through the archive on the origin web site, and can only switch to their preferred RSS feed reading tools when they reach the end of what’s posted to date. Making the full history available in the RSS feed allows readers to consistently use the tools they prefer.

    This isn’t as important for blog or news sites, where usually only the new content is relevant and older content is discovered through search or direct links. But I believe making full history available to third-party search tools and such is useful in that context too.

    In the absence of a standard mechanism for feed history, people build all sorts of web crawlers to extract the information they want. (For example, a decade ago I built the web crawler that Comic Rocket uses to index about 10,000 sites.) Publishing full history using RFC5005 is better because it allows the publisher to decide things like whether to include full text or only a summary in the feed. Webcomic creators have long fought against having their comics scraped out of their sites because they need their ad revenue and want control over the presentation of their work. (So Comic Rocket is very conservative about what we index.) Supporting RFC5005 strikes a good balance between giving creators that control and letting readers use tools that meet their needs. Consuming RSS feeds is easier than writing web crawlers so making this history available in a standard form will encourage people to respect publishers’ wishes.

    The easy part of implementing RFC5005 would be to support section 2, “Complete Feeds”. If WordPress generates an RSS feed that is not truncated (that is, there are fewer posts than the configured limit on syndication feed size), then it should add an <fh:complete/> tag and the corresponding xmlns:fh="http://purl.org/syndication/history/1.0" namespace declaration.

    That change is enough for feed readers to recognize that if an entry disappears from a small feed, it’s because that entry has been deleted, not because it scrolled off the end of the feed. This is useful for those feed readers that save old entries.

    When there are more posts than the configured limit on feed size, then section 4, “Archived Feeds”, comes into play instead. In this case, the primary RSS feed contains exactly the same contents it does today, with only the most recent N posts. However, it adds a <link rel="prev-archive"> tag to point to another feed containing some older posts. That in turn may point to a third feed, and so on. So feed readers that don’t care about RFC5005 work exactly as they do today, fetching only the most recent entries and ignoring the archive link.

    There are some tricky details because the archived feed at a given URL is not supposed to change in any “meaningful” way. It’s specified that way so that feed consumers can cache archive pages forever and avoid sending even conditional re-validation requests to the origin server. This means archived feeds can’t just correspond to the monthly posts view, in case a post is edited, deleted, or inserted deep in the archives; but it’s a huge efficiency win in the common case where the archives don’t change. If the WordPress team is open to supporting this standard, I have put a lot of thought into ways feed publishers can implement this specification correctly and I’d love to have a more extensive conversation about it.

    I’m open to providing patches for this feature, given some guidance about what kind of implementation would be accepted and some advice on which part of the codebase it should go in. I haven’t written any significant PHP since the late ’90s but I can probably figure something out…

Viewing 5 replies - 1 through 5 (of 5 total)
  • Moderator Samuel Wood (Otto)

    (@otto42)

    WordPress.org Admin

    While WordPress does not currently implement this, it would be fairly trivial to do something like this with a plugin.

    You would use the rss2_head action to insert the relevant <link rel="prev-archive"> tags into the header section. Similarly, if you need to add to the namespaces, then the rss2_ns action can be used to do that.

    As for displaying a “complete” feed, well, that is unlikely to be in core for the simple reason that it doesn’t scale well. If you run a blog with a hundred posts, fine. If you have thousands, or millions, then that suddenly becomes a problem.

    Generally speaking, WordPress does not “stream” post content in the manner you might be thinking of. When it wants to build a page of multiple posts, then it fetches them from the database into memory, then creates the page as a whole by looping through them and writing the HTML for each post. More or less. In this respect, feeds are no different. The feed is really just another view of the posts. So pulling all the posts into memory wouldn’t work for large sites. You could rewrite it as a streaming system, pull one post, output it, repeat, but then you run into database query issues. WordPress’s database abstraction layer is 15 years old. It lacks that capability, basically. So, while a complete feed is possible, it’s not very easy to do properly.

    A link to previous posts, on the other hand, is very simple by comparison. WordPress’s query logic already supports “paging”. So, if you altered the feed rewrites to support pages, then those parameters could be easily passed into the main WP_Query and you would essentially get feed archives for free, more or less. You’d need to add a new rewrite rule and some minor logic. So… that would take no more than about 20-30 lines of code to add. More if you wanted to class it all up as a proper plugin and such, but the basic logic is simple. This would also be much more likely to be added to core, since it is a simple improvement with very little downside. You should probably make a core ticket for it.

    Jamey Sharp

    (@jameysharp)

    This is very helpful, thank you! Those hook names gave me a good place to start browsing the source code.

    I’d like to ask a couple more questions before opening a ticket.

    First I want to clarify one point: I meant the “complete feed” case would only apply if the total number of posts is already less than the number of posts per feed page. I’m not proposing to generate giant feed documents with thousands of entries in them.

    I think this means adding the <fh:complete/> tag iff $wp_query->found_posts <= get_query_var('posts_per_page'). Does that look right? It seems like directly accessing the $wp_query global is somewhat discouraged, but found_posts isn’t available any other way, right?

    Second: It’d be nice to re-use the existing paging logic in order to build “archived feeds”, but it’s a little tricky to get right, so I’d like to run some thoughts by you.

    As I mentioned before, if we have an archive feed URL like ?feed=feed&paged=3, the spec says the contents at that URL should not change in any “meaningful” way. So for one thing, we should add &order=ASC so that the offset is relative to the beginning of time, not the always-changing now. With that change, this is almost good enough.

    Unfortunately it still breaks if someone edits, inserts, or deletes an old post. So we need something in the query string that we can change when a portion of the post history changes. (This is the same problem as cache-busting for CSS that’s been served with far-future expires headers.)

    One way is to store a change sequence number for every post. Every time the delete_post or save_post action fires, first increment a site-wide counter. Then, for every post that has the same or newer post_date, set a sequence number for that post equal to the new value of the counter. Finally, when generating a <link rel="prev-archive"> tag, look up the sequence number for the newest post in the target page, and add it to the query string.

    So the final query might look like ?feed=feed&order=ASC&paged=3&change_sequence=47, where the last number changes whenever anything in pages 1-3 change.

    My questions are:

    1. there isn’t already a change sequence number for posts, is there?
    2. does it make sense to store this sequence number in a hidden custom field using update_post_meta, or is there a better way to store it?
    3. is there a better way to store the site-wide change count than update_site_option?

    Thanks again for your advice! I almost feel like I could write the patch I want now.

    As a follow-up–I’ve written a simple plugin that appears to work correctly so long as you don’t edit or delete old posts. Your suggestions were a huge help in making that possible for me, so thank you! If you have any questions about things I’ve said so far, perhaps this implementation and its documentation will be more clear…

    https://github.com/jameysharp/wp-fullhistory

    I would still like to see this in core as this is a standard that ought to be supported pretty much everywhere that RSS and Atom are, but I hoped having a working proof of concept might help. I guess opening a ticket on core is still the next thing I should do, huh?

    Moderator Samuel Wood (Otto)

    (@otto42)

    WordPress.org Admin

    There’s not much chance of implementing such a non-changing feed link for archives. The content does change, and WordPress does not retain old content if it’s removed. There is the revision system, however that’s not reliable as many sites (mine included) turn it off for space and speed considerations.

    So, you’re not going to be able to create a case where changes cause your feed view to update. Not in any reliable way. WordPress doesn’t work that way, it stores a dynamic site with all views generated on the fly. Anything can change, anywhere on the site.

    I think I’ve done what I wanted, so I’m pretty sure what I want is possible…

    As of the new version I’ve just pushed here:

    https://github.com/jameysharp/wp-fullhistory

    If you save or delete a post, or change the ‘posts_per_rss’ option, and then reload /feed/, the “prev-archive” link will point to a different URL so RFC5005 compliant clients are guaranteed to re-fetch the archived feeds that have changed.

    I’d love to get feedback on whether there’s anything worrisome in this implementation, and whether it might be plausible to go in core.

Viewing 5 replies - 1 through 5 (of 5 total)
  • You must be logged in to reply to this topic.