Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend: RFC 5005 feed history support (RSS backfill) #1109

Closed
jameysharp opened this issue Jun 17, 2018 · 14 comments
Closed

Backend: RFC 5005 feed history support (RSS backfill) #1109

jameysharp opened this issue Jun 17, 2018 · 14 comments

Comments

@jameysharp
Copy link

One long-standing flaw for using RSS/Atom to read long-form works like webcomics and fanfiction has been that feeds usually contain only a limited number of the most recent entries, so you have to catch up by reading the site directly and then switch tools to get notified about new posts. In my experience it's much nicer to be able to use a single tool to keep track of how much of the story I've already read, no matter how far back in the history I leave off. A decade ago I built Comic Rocket as a proprietary tool to do that, but I'm hoping to advocate for a more standards-based approach.

It looks like NewsBlur saves old feed entries even after they disappear from the origin feed, but this doesn't reliably solve the problem, especially since creators often insert/edit/delete old pages and there's no way to detect that the cached feed entries are no longer valid.

RFC 5005, "Feed Paging and Archiving", addresses this problem, and I'd like to encourage people to adopt it. It was standardized in 2007, but seems to have languished in obscurity. I'm not aware of any publishers using it today aside from people I've personally encouraged, although it'd be fascinating to find out whether any feeds that newsblur.com has seen either use the http://purl.org/syndication/history/1.0 XML namespace, or contain <link rel="prev-archive">. There's something of a catch-22 here since publishers don't have much incentive to implement the spec if feed readers don't understand it, and vice versa.

That said, I'm working on various tools to generate full-history feeds by crawling arbitrary sites, as a transitional measure. So I'm hoping to find a project like NewsBlur that's willing to be an early adopter for the reader side of the spec.

The spec is nice in that conforming feeds are still usable by feed readers that don't understand the RFC 5005 metadata, but readers that do can save the complete history of all entries and efficiently discover changes to archived entries.

If you want to implement RFC 5005, I think the easiest first step is to check each feed for the <fh:complete/> tag specified in section 2, "Complete Feeds". If present, then you can delete all entries which you previously saved from that feed if they're no longer present in the current version of the feed. This is a very simple solution for feeds that don't have much history.

For feeds that would be excessively large if the publisher put the full history in one feed document, there's section 4, "Archived Feeds". To implement this, you'd check for <link rel="prev-archive"> in each feed, and concatenate the linked feed's entries, following further prev-archive links until there aren't any more. You can cache an archive feed from a given URL forever:

The requirement that archive documents be stable allows clients to safely assume that if they have retrieved one in the past, it will not meaningfully change in the future. As a result, if an archive document's contents are changed, some clients may not become aware of the changes.

I expect some conforming publishers will change an archive feed's URL if they need to update archived entries (although this is arguably discouraged in the spec). So you'd want to ensure that you can detect that a previously-seen archive feed is no longer in the chain of prev-archive links, and delete any entries that don't appear in the rest of the feed. I imagine the easiest way to do that is to reconstruct the feed history from scratch on every update, but I can imagine other alternatives.

That, plus the duplicate-detection and UI recommendations in section 4.2, I think should be everything you'd need to know about this part of the standard.

(I skipped section 3, "Paged Feeds", because I don't think it's relevant to a NewsBlur-style feed reader, but maybe there's a good use case for it that I haven't thought of.)

I might be able to put together a pull request for this, if it doesn't sound like wasted effort and if someone can advise me on how this might fit into the current code base. What do you think?

@samuelclay
Copy link
Owner

samuelclay commented Jul 12, 2018

Hey, I really love this idea and I read the thread of your work 2 days ago on Jekyll (smart thinking to start that PR first) but I'm not sure NewsBlur is going to be a good fit for this for one reason. That is the cost of archiving. NewsBlur today only supports at most the most recent 500 stories for a feed. It actively deletes stories over that threshold, not including shared and saved stories.

It's enormously expensive to host all that content, so I unfortunately have to trim it regularly. In fact, if I were to do the math, I think cleaning the archive consumes between 10-20% of my feed fetcher's time.

Now if people run their own NewsBlur instance, there's an easy one line change to boost that number up to virtually unlimited, but the main hosted instance will have to hold on to that limit for performance and cost reasons. I wish I were as big as even a small chunk of Google and had the resources to archive the web, but NewsBlur pulls tens of millions of websites, and all those stories have to go somewhere.

Plus, I immediately found a few "rss-bombs" that constantly published huge globs of randomized data. Filling out an archive is not a high enough priority that is being requested by users, so I don't have the financial incentive I need to support it. I wish it weren't so.

@dosiecki
Copy link
Contributor

A bit of a wild idea, but could there be a way to crowdfund the long-term storage of feed content? That is to say, any user could pay a few dollars to extend the storage for a given feed for an extra few thousand stories for a year or two. Popular feeds could attract multiple supporters and gain virtually unlimited storage. Users with a particular interest in an obscure feed could personally "archive" it by way of supporting it alone. I'd do this for certain feeds for sure!

@jameysharp
Copy link
Author

Thank you for this thoughtful response! I hadn't considered storage costs on your end; I will now keep that in mind as I talk with other folks and work on my own implementations.

I would like to still encourage implementing at least RFC5005 section 2, which would allow you to delete items as soon as they disappear from a conforming feed. That can only save you storage, right?

I do wonder if you could support section 4 by lazily loading feed pages from the origin server as the user browses through the history, treating your storage as purely a cache for such feeds. That might even allow you to be more aggressive about discarding items from your cache? I feel a little overwhelmed just imagining implementing that, but I thought I'd throw the idea out there anyway.

In case you revisit this in the future, I'll mention that in addition to jekyll/jekyll-feed#236 which you already saw, I've also built https://fh.minilop.net/ and https://github.com/jameysharp/wp-fullhistory as two other implementations of sections 2 and 4 of RFC5005.

@mockdeep
Copy link

This would be amazing! @samuelclay not sure how much it would cost for something like that per user, but I'd be willing to pay for a higher cost tier if this was available. Maybe some of the cost could be limited by not making it the default behavior, but allowing users to "download feed backlog" somewhere for individual feeds.

@samuelclay
Copy link
Owner

Pinging back on this thread, by way of jekyll/jekyll-feed#236 (as I'm about to launch this feature). So good news, this is now high priority and is well on its way to the public.

@samuelclay samuelclay changed the title Wishlist: RFC 5005 feed history support Backend: RFC 5005 feed history support (RSS back fill) Apr 18, 2022
@jameysharp
Copy link
Author

Cool! Anything I can do to help?

You might like to take a look at this prototype feed-reader I built in 2020 (https://github.com/jameysharp/crawl-rss) which demonstrates an algorithm and database schema that made sense to me, and includes a bunch of unit tests covering different kinds of edits publishers might make to their feed history. I licensed that AGPLv3 but if anyone wants to reuse any of the tests, just go for it.

In the jekyll-feed issue you mentioned "joining WordPress in automatically enabling feed paging;" do you know something I don't? To the best of my knowledge, the only discussion that's ever happened about this on the WordPress side are the discussion thread and issue that I opened; the latter never got any response at all.

That said, my wp-fullhistory plugin still seems to work; it's running at https://news.comic-rocket.com for example. In addition, my crawl-rss prototype feed reader demonstrates using a WordPress-specific stateless proxy I wrote (https://github.com/jameysharp/wp-5005-proxy) that synthesizes a full-history feed for any existing WordPress install, without needing cooperation from the publisher's side.

While RFC5005 support remains sparse among publishers right now, I think delegating to specialized HTTP proxies like wp-5005-proxy could be a good way to adapt many sites which don't already support RFC5005. It makes for a pretty simple "plugin API", in my opinion. Perhaps you'd like to do something similar as well?

@samuelclay
Copy link
Owner

In the jekyll-feed issue you mentioned "joining WordPress in automatically enabling feed paging;" do you know something I don't?

I'm still doing development work so I don't have real numbers yet (although this query will inform which numbers I will surface), but in my testing a subset of my own feeds, I went from NewsBlur's limit of 26,225 stories to 49,263. And that's with a page limit of 100, which I will probably boost to 500.

This is by adding ?page=N and ?paged=N to the feed url and seeing if the stories come out differently from 1..3, and if they do to keep going until no new stories are seen. And that's worked for a ton of feeds, so I assumed it was all WP, but it's possible it's not. It might just be the result of a few high volume feeds, in which case, the numbers I need to pull are about how many feeds go beyond page 4 (in other words, supports paging), and how many stories are found per archive feed.

@samuelclay
Copy link
Owner

Funny enough, after looking through RFC5005 I realize I'm not following it at all, so I'll try and implement that behavior as well today.

@samuelclay samuelclay reopened this Apr 19, 2022
@jameysharp
Copy link
Author

jameysharp commented Apr 19, 2022 via email

samuelclay added a commit that referenced this issue Apr 19, 2022
@samuelclay
Copy link
Owner

Then the question is, once you've fetched all the history, how do you detect further changes in the history?

I periodically force a re-fetch of the entire history. So changes eventually make their way in but not immediately.

Although it may not be relevant for your use case, the RFC also enables feed readers to treat their local copy of the feed purely as a cache.

I noticed this but you're right, I'm not deleting stories that are removed from a publisher's archive. Publishers will sometimes email me to ask, and I am always happy to remove it that way. But if a user saved or shared the story, that saved or shared story continues to exist.

I hope this has helped clarify why I think this standard is important if you want feed history.

Agreed that this would be an ideal world, but reality is that I think the page and paged parameters won out and it's up to feed readers to do the dirty work of distinguishing changes and staying up to date, as has been the case before PubSubHubbub.

@samuelclay samuelclay changed the title Backend: RFC 5005 feed history support (RSS back fill) Backend: RFC 5005 feed history support (RSS backfill) Apr 20, 2022
@jameysharp
Copy link
Author

jameysharp commented Apr 21, 2022 via email

@samuelclay
Copy link
Owner

@jameysharp Ok, I've just about implemented it but there's an issue coming from the test server you put up. I start with the first URL, and each successive URL is the prev-archive link. Notice the protocols and ports. Redirects aren't working as I would expect.

>>> import feedparser
>>> pprint(feedparser.parse("https://fh.minilop.net/7/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d").feed.links)
[{'href': 'http://fh.minilop.net/e/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'alternate',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net/f/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'current',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net/8/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'next-archive',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net/6/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'prev-archive',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net/7/America%2BNew_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'self',
  'type': 'application/rss+xml'}]
>>> pprint(feedparser.parse("http://fh.minilop.net/6/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d").feed.links)
[{'href': 'http://fh.minilop.net:443/e/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'alternate',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net:443/f/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'current',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net:443/7/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'next-archive',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net:443/5/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'prev-archive',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net:443/6/America%2BNew_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'self',
  'type': 'application/rss+xml'}]
>>> pprint(feedparser.parse("http://fh.minilop.net:443/5/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d"))
{'bozo': 1,
 'bozo_exception': SAXParseException('mismatched tag'),
 'encoding': 'us-ascii',
 'entries': [],
 'feed': {'summary': '<center><h1>400 Bad Request</h1></center>\n'
                     '<center>The plain HTTP request was sent to HTTPS '
                     'port</center>\n'
                     '<hr /><center>nginx</center>'},
 'headers': {'connection': 'close',
             'content-length': '264',
             'content-type': 'text/html',
             'date': 'Fri, 29 Apr 2022 15:28:14 GMT',
             'server': 'nginx',
             'strict-transport-security': 'max-age=15724800; '
                                          'includeSubdomains'},
 'href': 'http://fh.minilop.net:443/5/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
 'namespaces': {},
 'status': 400,
 'version': ''}

samuelclay added a commit that referenced this issue Apr 29, 2022
…mands and adding checks for archive subscribers.
@jameysharp
Copy link
Author

Ugh, my personal server is misconfigured, I guess. Try these:

Also, here are a couple of examples of feeds which aren't paginated but which include the fh:complete tag from RFC5005 section 2, in case you want to do anything with that:

@samuelclay
Copy link
Owner

Looking good!

Screen Shot 2022-04-29 at 3 22 50 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants