-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ID generator v0.2 on builder #223
Conversation
To avoid leaking internal logic, the v0.2 ID generator is now a method on the parser builder.
} | ||
}) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is a good start and in my test dataset it correctly classifies aproblemsquared.feeds
benjojo.feeds
czwórka-kulturka.feeds
delan.feeds
epilys.feeds
hannahlyse-this.feeds
isis.feeds
julia_evans.feeds
kate_murphy.feeds
kdist.feeds
kdist.xml.feeds
ken_schriff.feeds
Kia.feeds
lateral.feeds
Pirate.feeds
poptart.feeds
rabbit-hole.feeds
rustlang-blog.feeds
RUSTSEC.feeds
stapelberg-debian.feeds
stapelberg-distri.feeds
The6P4C.feeds
ThePhD.feeds
whitequark.feeds
наб.feeds
But this still incorrectly classifies 2-girls-1-podcast.ids
cpp-grey.ids
eater.ids
GregDavill.ids
magan-he-danai.ids
marco.ids
mia-kreanikacji.ids
rae-the-doe.ids
tamen-de-gushi.ids
tef.ids
These tests come from https://github.com/nabijaczleweli/feembox/tree/trunk/test-data/feed.ids as driven by https://github.com/nabijaczleweli/feembox/blob/07a90ffe32a69822d9d9681c911b5a961f1bf57e/tests/util/mod.rs#L47 (but s/parse_feed/FeedBuilder::new().id_generator_v0_2().build().parse/
)
At least in some cases (
generate_id_from_link_and_title="https://www.dailydot.com/tags/2-girls-1-podcast" Some("2 Girls 1 Podcast")
generate_id_from_link_and_title="http://morwenn.github.io/cpp-gray" Some("cpp-gray")
generate_id_from_link_and_title="https://blog.eater.me" Some("eater")
generate_id_from_link_and_title="https://gregdavill.com" Some("Greg Davill's Projects")
) this is fixed by doing
let mut hr = link.href.clone();
hr.push('/');
hasher.write(hr.as_bytes());
in generate_id_from_link_and_title().
This obviously breaks everything else, but makes it obvious that it wants more complicated logic that matches this condition in 0.2: https://github.com/feed-rs/feed-rs/pull/189/files#diff-902de347eed4eef0831eb024307ec9eb55dab63e247d0e8f7ba2f016848128baR638
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, naturally it's because this implementation re-parses Link::href. So an input of https://blog.benjojo.co.uk
becomes href=https://blog.benjojo.co.uk/
as part of links
given to id_generator
, so then re-parsing it in id_generator_v0_2() sees "https://blog.benjojo.co.uk/"
as both the href and uri result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not pretty, but we need to inject the bool into every Link::new invocation in the parser, so this is the simplest one:
diff --git a/feed-rs/src/model.rs b/feed-rs/src/model.rs
index 3756d83..b337896 100644
--- a/feed-rs/src/model.rs
+++ b/feed-rs/src/model.rs
@@ -1,4 +1,5 @@
use std::time::Duration;
+use std::cell::Cell;
use chrono::{DateTime, Utc};
use mime::Mime;
@@ -628,11 +629,20 @@ pub struct Link {
/// The length of the resource, in bytes.
pub length: Option<u64>,
}
+thread_local! {
+ pub(crate) static LINK_V0_2: Cell<bool> = Cell::new(false);
+}
impl Link {
pub(crate) fn new<S: AsRef<str>>(href: S, base: Option<&Url>) -> Link {
let href = match util::parse_uri(href.as_ref(), base) {
- Some(uri) => uri.to_string(),
+ Some(uri) => {
+ let mut ret: String = uri.into();
+ if LINK_V0_2.get() && ret.ends_with('/') && !href.as_ref().ends_with('/') {
+ ret.pop();
+ }
+ ret
+ },
None => href.as_ref().to_string(),
}
.trim()
diff --git a/feed-rs/src/parser/mod.rs b/feed-rs/src/parser/mod.rs
index 26200d7..8fd2ca0 100644
--- a/feed-rs/src/parser/mod.rs
+++ b/feed-rs/src/parser/mod.rs
@@ -11,6 +11,7 @@ use crate::model;
use crate::parser::util::{IdGenerator, TimestampParser};
use crate::xml;
use crate::xml::NS;
+use crate::model::LINK_V0_2;
mod atom;
mod json;
@@ -106,6 +107,7 @@ pub struct Parser {
base_uri: Option<String>,
id_generator: Box<IdGenerator>,
timestamp_parser: Box<TimestampParser>,
+ link_v0_2: bool,
}
impl Parser {
@@ -143,6 +145,7 @@ impl Parser {
// Buffer the reader for performance (e.g. when streaming from a network) and so we can peek to determine the type of content
let mut input = BufReader::new(source);
+ LINK_V0_2.set(self.link_v0_2);
// Determine whether this is XML or JSON and call the appropriate parser
input.fill_buf()?;
let first_char = input.buffer().iter().find(|b| **b == b'<' || **b == b'{').map(|b| *b as char);
@@ -153,6 +156,7 @@ impl Parser {
_ => Err(ParseFeedError::ParseError(ParseErrorKind::NoFeedRoot)),
};
+ LINK_V0_2.set(false);
// Post processing as required
if let Ok(mut feed) = result {
@@ -226,6 +230,7 @@ pub struct Builder {
base_uri: Option<String>,
id_generator: Box<IdGenerator>,
timestamp_parser: Box<TimestampParser>,
+ link_v0_2: bool,
}
impl Builder {
@@ -246,6 +251,7 @@ impl Builder {
base_uri: self.base_uri,
id_generator: self.id_generator,
timestamp_parser: self.timestamp_parser,
+ link_v0_2: self.link_v0_2,
}
}
@@ -259,16 +265,10 @@ impl Builder {
}
/// Registers an ID generator compatible with v0.2 of feed-rs
- pub fn id_generator_v0_2(self) -> Self {
+ pub fn id_generator_v0_2(mut self) -> Self {
+ self.link_v0_2 = true;
self.id_generator(|links, title, _uri| {
- // If we have a link without relative components, use that
if let Some(link) = links.iter().find(|l| l.rel.is_none()) {
- // Trim the trailing slash if it exists
- let mut link = model::Link::new(link.href.clone(), None);
- if link.href.ends_with('/') {
- link.href.pop();
- }
-
generate_id_from_link_and_title(&link, title)
} else {
util::uuid_gen()
@@ -293,6 +293,7 @@ impl Default for Builder {
base_uri: None,
id_generator: Box::new(generate_id),
timestamp_parser: Box::new(util::parse_timestamp_lenient),
+ link_v0_2: false,
}
}
}
To avoid leaking internal logic, the v0.2 ID generator is now a method on the parser builder.