Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ID generator v0.2 on builder #223

Merged
merged 1 commit into from
May 5, 2024
Merged

Conversation

markpritchard
Copy link
Contributor

To avoid leaking internal logic, the v0.2 ID generator is now a method on the parser builder.

To avoid leaking internal logic, the v0.2 ID generator is now a method
on the parser builder.
@markpritchard markpritchard merged commit 343ce24 into master May 5, 2024
5 checks passed
@markpritchard markpritchard deleted the legacy-id-refactor branch May 5, 2024 00:56
}
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is a good start and in my test dataset it correctly classifies aproblemsquared.feeds benjojo.feeds czwórka-kulturka.feeds delan.feeds epilys.feeds hannahlyse-this.feeds isis.feeds julia_evans.feeds kate_murphy.feeds kdist.feeds kdist.xml.feeds ken_schriff.feeds Kia.feeds lateral.feeds Pirate.feeds poptart.feeds rabbit-hole.feeds rustlang-blog.feeds RUSTSEC.feeds stapelberg-debian.feeds stapelberg-distri.feeds The6P4C.feeds ThePhD.feeds whitequark.feeds наб.feeds

But this still incorrectly classifies 2-girls-1-podcast.ids cpp-grey.ids eater.ids GregDavill.ids magan-he-danai.ids marco.ids mia-kreanikacji.ids rae-the-doe.ids tamen-de-gushi.ids tef.ids

These tests come from https://github.com/nabijaczleweli/feembox/tree/trunk/test-data/feed.ids as driven by https://github.com/nabijaczleweli/feembox/blob/07a90ffe32a69822d9d9681c911b5a961f1bf57e/tests/util/mod.rs#L47 (but s/parse_feed/FeedBuilder::new().id_generator_v0_2().build().parse/)

At least in some cases (

generate_id_from_link_and_title="https://www.dailydot.com/tags/2-girls-1-podcast" Some("2 Girls 1 Podcast")
generate_id_from_link_and_title="http://morwenn.github.io/cpp-gray" Some("cpp-gray")
generate_id_from_link_and_title="https://blog.eater.me" Some("eater")
generate_id_from_link_and_title="https://gregdavill.com" Some("Greg Davill's Projects")

) this is fixed by doing

    let mut hr = link.href.clone();
    hr.push('/');
    hasher.write(hr.as_bytes());

in generate_id_from_link_and_title().
This obviously breaks everything else, but makes it obvious that it wants more complicated logic that matches this condition in 0.2: https://github.com/feed-rs/feed-rs/pull/189/files#diff-902de347eed4eef0831eb024307ec9eb55dab63e247d0e8f7ba2f016848128baR638

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, naturally it's because this implementation re-parses Link::href. So an input of https://blog.benjojo.co.uk becomes href=https://blog.benjojo.co.uk/ as part of links given to id_generator, so then re-parsing it in id_generator_v0_2() sees "https://blog.benjojo.co.uk/" as both the href and uri result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not pretty, but we need to inject the bool into every Link::new invocation in the parser, so this is the simplest one:

diff --git a/feed-rs/src/model.rs b/feed-rs/src/model.rs
index 3756d83..b337896 100644
--- a/feed-rs/src/model.rs
+++ b/feed-rs/src/model.rs
@@ -1,4 +1,5 @@
 use std::time::Duration;
+use std::cell::Cell;
 
 use chrono::{DateTime, Utc};
 use mime::Mime;
@@ -628,11 +629,20 @@ pub struct Link {
     /// The length of the resource, in bytes.
     pub length: Option<u64>,
 }
+thread_local! {
+    pub(crate) static LINK_V0_2: Cell<bool> = Cell::new(false);
+}
 
 impl Link {
     pub(crate) fn new<S: AsRef<str>>(href: S, base: Option<&Url>) -> Link {
         let href = match util::parse_uri(href.as_ref(), base) {
-            Some(uri) => uri.to_string(),
+            Some(uri) => {
+                let mut ret: String = uri.into();
+                if LINK_V0_2.get() && ret.ends_with('/') && !href.as_ref().ends_with('/') {
+                    ret.pop();
+                }
+                ret
+            },
             None => href.as_ref().to_string(),
         }
         .trim()
diff --git a/feed-rs/src/parser/mod.rs b/feed-rs/src/parser/mod.rs
index 26200d7..8fd2ca0 100644
--- a/feed-rs/src/parser/mod.rs
+++ b/feed-rs/src/parser/mod.rs
@@ -11,6 +11,7 @@ use crate::model;
 use crate::parser::util::{IdGenerator, TimestampParser};
 use crate::xml;
 use crate::xml::NS;
+use crate::model::LINK_V0_2;
 
 mod atom;
 mod json;
@@ -106,6 +107,7 @@ pub struct Parser {
     base_uri: Option<String>,
     id_generator: Box<IdGenerator>,
     timestamp_parser: Box<TimestampParser>,
+    link_v0_2: bool,
 }
 
 impl Parser {
@@ -143,6 +145,7 @@ impl Parser {
         // Buffer the reader for performance (e.g. when streaming from a network) and so we can peek to determine the type of content
         let mut input = BufReader::new(source);
 
+        LINK_V0_2.set(self.link_v0_2);
         // Determine whether this is XML or JSON and call the appropriate parser
         input.fill_buf()?;
         let first_char = input.buffer().iter().find(|b| **b == b'<' || **b == b'{').map(|b| *b as char);
@@ -153,6 +156,7 @@ impl Parser {
 
             _ => Err(ParseFeedError::ParseError(ParseErrorKind::NoFeedRoot)),
         };
+        LINK_V0_2.set(false);
 
         // Post processing as required
         if let Ok(mut feed) = result {
@@ -226,6 +230,7 @@ pub struct Builder {
     base_uri: Option<String>,
     id_generator: Box<IdGenerator>,
     timestamp_parser: Box<TimestampParser>,
+    link_v0_2: bool,
 }
 
 impl Builder {
@@ -246,6 +251,7 @@ impl Builder {
             base_uri: self.base_uri,
             id_generator: self.id_generator,
             timestamp_parser: self.timestamp_parser,
+            link_v0_2: self.link_v0_2,
         }
     }
 
@@ -259,16 +265,10 @@ impl Builder {
     }
 
     /// Registers an ID generator compatible with v0.2 of feed-rs
-    pub fn id_generator_v0_2(self) -> Self {
+    pub fn id_generator_v0_2(mut self) -> Self {
+        self.link_v0_2 = true;
         self.id_generator(|links, title, _uri| {
-            // If we have a link without relative components, use that
             if let Some(link) = links.iter().find(|l| l.rel.is_none()) {
-                // Trim the trailing slash if it exists
-                let mut link = model::Link::new(link.href.clone(), None);
-                if link.href.ends_with('/') {
-                    link.href.pop();
-                }
-
                 generate_id_from_link_and_title(&link, title)
             } else {
                 util::uuid_gen()
@@ -293,6 +293,7 @@ impl Default for Builder {
             base_uri: None,
             id_generator: Box::new(generate_id),
             timestamp_parser: Box::new(util::parse_timestamp_lenient),
+            link_v0_2: false,
         }
     }
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants