Skip to content
This repository has been archived by the owner on May 30, 2022. It is now read-only.

Fix attribute parsing #4

Merged
merged 8 commits into from May 8, 2022
Merged

Fix attribute parsing #4

merged 8 commits into from May 8, 2022

Conversation

Mingun
Copy link
Owner

@Mingun Mingun commented May 3, 2022

This PR adds tests for attribute parsing and fixes bugs found. It introduces new AttrError type that now returned by Attributes::next() and contains all possible errors when attributes is parsed.

Also it introduces a new type Attr which is by design is borrowed-only. In version 0.24 I plan to change Attributes iterator and return that new type.

@dralley, it would be great to run performance tests and see the results and maybe make optimizations like that you've done in #3

@Mingun Mingun mentioned this pull request May 3, 2022
@dralley
Copy link
Contributor

dralley commented May 3, 2022

Sure, I'll try to review in the next few days (when time allows)

@dralley
Copy link
Contributor

dralley commented May 3, 2022

I assume this is an exhaustive list of (tests demonstrating) the various issues with the previous parsing algorithm? Are there any non-covered cases that you're aware of (ignoring the attribute value normalization work that I'll eventually get back to)

    events::attributes::html::duplicated::with_check::double_quoted
    events::attributes::html::duplicated::with_check::key_only
    events::attributes::html::duplicated::with_check::single_quoted
    events::attributes::html::duplicated::with_check::unquoted
    events::attributes::html::single::missed_value
    events::attributes::html::sparsed::double_quoted
    events::attributes::html::sparsed::key_contains_invalid
    events::attributes::html::sparsed::key_only
    events::attributes::html::sparsed::key_start_invalid
    events::attributes::html::sparsed::missed_value
    events::attributes::html::sparsed::single_quoted
    events::attributes::html::sparsed::unquoted
    events::attributes::xml::duplicated::with_check::double_quoted
    events::attributes::xml::duplicated::with_check::key_only
    events::attributes::xml::duplicated::with_check::single_quoted
    events::attributes::xml::duplicated::with_check::unquoted
    events::attributes::xml::duplicated::without_check::key_only
    events::attributes::xml::duplicated::without_check::unquoted
    events::attributes::xml::first::key_only
    events::attributes::xml::first::missed_value
    events::attributes::xml::first::unquoted
    events::attributes::xml::single::key_only
    events::attributes::xml::single::missed_value
    events::attributes::xml::sparsed::missed_value

@Mingun
Copy link
Owner Author

Mingun commented May 4, 2022

I assume this is an exhaustive list of (tests demonstrating) the various issues with the previous parsing algorithm?

Yes, it is. If you checkout this commit you will see what exactly errors present. Particularly, in some cases key contains spaces, in many cases after returning Err next next() call returns None and in some returns next attribute.

The attributes parser itself is very rough and do not follow specification. It doesn't check for illegal symbols in a key and unquoted value, entities does not expanded (but that is fine at that level -- you anyway needs an entity manager to get a list of defined entities). For now I left that as "by design" -- correctness of each attribute should be checked after getting it. But I have plans to try to replace manually written parser with one generated by grammar using rust-peg with more robust parsing.

If you want to work on attribute normalization, some my minds: first, you need to introduce an entity manager. That manager should collect entities defined in a document and allow user to define external entities. Collecting entities requires, in turn, DTD parser. There already exist crates for that, I did not look deeply. May be we can use some of them.

Then normalization would require reference to the entity manager. No need to store it inside an attribute, it is better to create a dedicated type for a well-formed attribute that would created by normalization. In fact, that PR already did that differentiation: Attr by design contains borrowed unprocessed data to the input, and created only by the parser, but Attribute could be created by the user and have own data.

Introducing entity manager and doing attribute normalization could be done in different PRs, because entity manager should be used everywhere

@Mingun Mingun force-pushed the fix-attributes branch 2 times, most recently from 7d4131b to e090006 Compare May 5, 2022 17:09
Copy link
Owner Author

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, new algorithm around 5 us (~10%) slower than previous, which, I believe, because of recovery checks. Not a significance difference

None => return attr!(self.position..len),
pub fn next(&mut self, slice: &[u8]) -> Option<AttrResult> {
let mut iter = match self.recover(slice) {
Some(offset) => (offset..).zip(slice[offset..].iter()),
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this instead of

slice.iter().enumerate().skip(offset)

really improves performance around by 10 us (~15%), according to Criterion, so applied it

}
};

match iter.find(|(_, &b)| b == quote) {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried to replace this by

match memchr::memchr(quote, &slice[start_value..]) {
  ...
}

but that does not give a significance difference

src/events/attributes.rs Outdated Show resolved Hide resolved
src/events/attributes.rs Outdated Show resolved Hide resolved
src/events/attributes.rs Outdated Show resolved Hide resolved
src/events/attributes.rs Outdated Show resolved Hide resolved
src/events/attributes.rs Outdated Show resolved Hide resolved
src/events/attributes.rs Outdated Show resolved Hide resolved
/// That error can be raised only if iterator in XML mode.
UnquotedValue(usize),
/// Attribute value not finished with a matching quote, position relative to
/// start of owning tag and a quote is provided. That position is always a last
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this to mean that "the position will be after a quote character" but to say that the position is relative to the "start of the owning tag and a quote" is a little bit confusing. It sounds like it could mean either of

A) the position will be on a quote character, which isn't true
B) the position is relative to both, which isn't possible

I think "and a quote" could be removed and it would still make sense (given the examples).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to rephrase it and I'll add it to the other your suggestions.

src/events/attributes.rs Outdated Show resolved Hide resolved
src/events/attributes.rs Outdated Show resolved Hide resolved
src/events/attributes.rs Outdated Show resolved Hide resolved
@dralley
Copy link
Contributor

dralley commented May 5, 2022

@Mingun I'm going through commit-by-commit so this isn't a final review yet. I'll continue in a few hours.

The code looks great so far, I only have a few minor grammatical / wording nitpicks.

@Mingun
Copy link
Owner Author

Mingun commented May 5, 2022

Ok. Yeah, grammar should be checked because I'm not a native speaker :) I tried, but native speaker always can write it better

@Mingun
Copy link
Owner Author

Mingun commented May 6, 2022

@dralley, tell me when you finish with review. I'll plan to make a 0.23 release after merging this.

Copy link
Contributor

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mingun Here's another round of review, I have some questions and some code suggestions

@@ -41,7 +42,7 @@ where
/// do not store reference to `Attributes` itself but instead create
/// a new object on each advance of `Attributes` iterator, so we need
/// to restore last position before advance.
position: usize,
iter: IterState,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some other questions about this (below) but the docstring needs to be updated.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow forgot to save changes in the editor, so comment is changed outside of this PR in f9453b9

/// if `with_checks`, contains the ranges corresponding to the
/// attribute names already parsed in this `Element`
consumed: Vec<Range<usize>>,
/// Iterator state, independent from the actual source of bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind the change?

It leads to some weird interfaces below - e.g. all the methods of IterState take an external slice, and the caller could theoretically call iter.next(...) or other methods with two entirely different slices. The slice has to get passed around more often, the methods here are now delegating directly to self.state, etc.

Intuitively leaving everything in the Attributes struct as it was makes more sense to me, but I might be missing some details.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for another iterator implementation to share the code. The problem is described in tafia/quick-xml#337 (comment): we have a borrowing iterator, but doesn't have an owned iterator. For example, for MapAccess we need an owned iterator and in this PR MapAccess de facto implements it. In the future we should expose that iterator as public API

In any case IterState is supposed to stay an internal detail and will never be exposed in public API. All internal usages will guaranties that they will pass correct slices

}
Some((i, _)) => err!(AttrError::UnquotedValue(i)),
None => return attr!(start_key..end_key),
match self.state.next(self.bytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels awkward because it's not the signature you'd expect from a next() method (even though this isn't trying to implement Iterator::next(), it's still the same thing conceptually). Usually iterators know what they're iterating over.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for lifetime correctness. Owned iterator will store a Cow<'input, [u8]>, borrowing iterator stores a &'input [u8]. If you will try to move slice argument of next into IterState and store in a Cow<'input, [u8]>, even when borrowing iterator always will pass a Cow::Borrowed, borrow checker will not allow you to return slices with the 'input lifetime, because Cow potentially can be owned.

You also cannot store a &'input [u8] in the IterState, because owned iterator is supposed to be

struct OwnedAttributes<'input> {
  slice: Cow<'input, [u8]>,
  state: IterState,
}

and you cannot store in that struct reference to a field of this struct (state would be referenced to a slice)

/// [events]: crate::events::Event::Start
Unquoted(T, T),
/// Attribute without value. Attribute key provided. This is HTML-style attribute,
/// it can be returned only in HTML-mode parsing only. In XML mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"only in HTML-mode parsing only" => I think you meant to remove one of the "only"s

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

impl<T> Attr<T> {
/// Maps an `Attr<T>` to `Attr<U>` by applying a function to a contained key and value.
#[inline]
pub fn map<U, F>(self, mut f: F) -> Attr<U>
Copy link
Contributor

@dralley dralley May 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Regarding the whole implementation of Attr, not this specific method)

I see where you're coming from with all of this, but there are some downsides to doing it this way.

  • it would add a lot more branches to the generated code, because every access has to go through match
  • also general code bloat when it gets inlined everywhere that a dev uses .key(), .value(), etc.
  • it feels like a detail few users would care too much about, even ones parsing HTML

As far as I can tell, HTML doesn't have any attributes that are required to not have a value - attributes with no value, and attributes with an empty-string value are supposed to be treated identically.

A number of attributes are boolean attributes. The presence of a boolean attribute on an element represents the true value, and the absence of the attribute represents the false value.

If the attribute is present, its value must either be the empty string or a value that is an ASCII case-insensitive match for the attribute's canonical name, with no leading or trailing whitespace.

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#boolean-attributes

Just the attribute name. The value is implicitly the empty string.

https://html.spec.whatwg.org/multipage/syntax.html#syntax-attr-empty

So given that the spec says that all of these syntaxes are equivalent ways to express the attribute, I'm not sure that all of this is worth it, compared to just having a struct with key + value fields. I wouldn't have a problem with a third field to denote unquoted, empty, single-quote, double-quote though. It's just not important enough to treat like Option or Result and force dealing with specific variants.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That benchmarks shows only small 5 us regression (from ~75 to ~80 with checks, from ~44 to ~49 without checks, from ~155 to ~159 for try_get_attribute) in parsing which, I think, not a big pay for an additional info, that could be useful:

fast-xml/benches/bench.rs

Lines 256 to 333 in 6a3fd90

/// Benchmarks parsing attributes from events
fn attributes(c: &mut Criterion) {
let mut group = c.benchmark_group("attributes");
group.bench_function("with_checks = true", |b| {
b.iter(|| {
let mut r = Reader::from_reader(PLAYERS);
r.check_end_names(false).check_comments(false);
let mut count = criterion::black_box(0);
let mut buf = Vec::new();
loop {
match r.read_event(&mut buf) {
Ok(Event::Empty(e)) => {
for attr in e.attributes() {
let _attr = attr.unwrap();
count += 1
}
}
Ok(Event::Eof) => break,
_ => (),
}
buf.clear();
}
assert_eq!(count, 1041);
})
});
group.bench_function("with_checks = false", |b| {
b.iter(|| {
let mut r = Reader::from_reader(PLAYERS);
r.check_end_names(false).check_comments(false);
let mut count = criterion::black_box(0);
let mut buf = Vec::new();
loop {
match r.read_event(&mut buf) {
Ok(Event::Empty(e)) => {
for attr in e.attributes().with_checks(false) {
let _attr = attr.unwrap();
count += 1
}
}
Ok(Event::Eof) => break,
_ => (),
}
buf.clear();
}
assert_eq!(count, 1041);
})
});
group.bench_function("try_get_attribute", |b| {
b.iter(|| {
let mut r = Reader::from_reader(PLAYERS);
r.check_end_names(false).check_comments(false);
let mut count = criterion::black_box(0);
let mut buf = Vec::new();
loop {
match r.read_event(&mut buf) {
Ok(Event::Empty(e)) if e.name() == b"player" => {
for name in ["num", "status", "avg"] {
if let Some(_attr) = e.try_get_attribute(name).unwrap() {
count += 1
}
}
assert!(e
.try_get_attribute("attribute-that-doesn't-exist")
.unwrap()
.is_none());
}
Ok(Event::Eof) => break,
_ => (),
}
buf.clear();
}
assert_eq!(count, 150);
})
});
group.finish();
}

Besides, I'll also have a next PR with Span support and I want to expose the information from where that or another piece of data come. That is dependent of the attribute shape and useful for reporting errors.

bq. I wouldn't have a problem with a third field to denote unquoted, empty, single-quote, double-quote though. It's just not important enough to treat like Option or Result and force dealing with specific variants.

That wouldn't be idiomatic Rust. Usually you will use .key() and .value() methods, so you hardly likely will deal with the concrete variant. Moreover, I've introduced a new Attr type instead of modifying Attribute to allow users to not deal with attribute shape when it do not matter. There will only one conversion from Attr to Attribute that just moves shape erasing from internals of iterator to a user-defined place. That should not bloat code well and likely will be optimized by the compiler if you did that just after getting the attribute from the iterator.

Copy link
Contributor

@dralley dralley May 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wouldn't be idiomatic Rust.

There are cases where it actually is idiomatic, or at least a common pattern. std::io::ErrorKind is used because all variants have the same signature (like is the case here). nom::error::ErrorKind is the same.

The problem with going purely off the benchmarks here is that depending on the consistency of the input, the branch predictor is going to get it right most of the time (and if it's a small enough input, the the branch predictor is going to learn it over the course of many iterations even if it's not consistent), so it's not going to look that expensive.

But it's very dependent on the input, and how much the extra bloat hurts depends a lot on what code the user is writing and how much cache their CPU has. It's just a lot of extra stuff to put in such a hot path.

Maybe it could look like this?

struct Attr {
    key: ...,
    value: ...,
    shape: AttrShape,
}

enum AttrShape {
    Unqoted(..span stuff..),
    DoubleQuoted(..span stuff..),
    SingleQuoted(..span stuff..),
    Empty(..span stuff..),
}

And then the user could avoid the cost completely if they don't care about it, and the Span work still has an obvious place to go.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that what you suggesting, is implemented now:

  • your AttrShape is Attr<Range<usize>> -- that is the type that returned by IterState::next()
  • your Attr is the future Attribute

Because we in any case need to get spans for key and value, and because value can be missed, and because I want to keep only one parser implementation and not copy it for different types, the current IterState design seems to me the almost only possible (the other way is return (Range<usize>, Option<Range<usize>>) instead but I don't think that it gives any benefits while reducing code clarity).

I see that your point in possible code bloat and/or inconveniences to use match constantly, but that is impossible to trigger from the user code now, because the only way to create an Attr instance in the user code is to create it by yourself. IterState is a private type. The only place where conversion from Attr to Attribute is performed now is inside Attributes::next():

impl<'a> Iterator for Attributes<'a> {
type Item = Result<Attribute<'a>, AttrError>;
#[inline]
fn next(&mut self) -> Option<Self::Item> {
match self.state.next(self.bytes) {
None => None,
Some(Ok(a)) => Some(Ok(a.map(|range| &self.bytes[range]).into())),
Some(Err(e)) => Some(Err(e)),
}
}
}

I still believe that compiler is able to optimize all right there. Maybe that place seems too generic, because I hide conversion under map() + Into<Attribute>, but it is fairly straight-ford

Attr::Unquoted(key, value) => (key, Some(value)),
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all this code would be deleted up to my previous comment. key() and value() methods would stay but it would just be a simple accessor.

@dralley
Copy link
Contributor

dralley commented May 8, 2022

@Mingun Barring the two docstrings that need to be fixed, let's call this approved - I'd like to experiment for myself to either better understand the problems you've discussed and maybe gather some concrete data about the costs or lack thereof of the extra branching - but since that will take some time I won't block the PR on it. The tests pass at least.

Mingun and others added 8 commits May 8, 2022 14:02
Also fix incorrect references to `Attributes` instead of `Attribute`
.find(|&(_, &b)| b == b'=' || is_whitespace(b))

cannot return Some((_, b'"')) or Some((_, b'\'')) because `"` and `'` are not whitespaces
Because iterator over attributes is recovering, user can want to inspect
possible errors and filtering out irrelevant errors should help him

Co-authored-by: Daniel Alley <dalley@redhat.com>
Because now error type is PartialEq, we can use `assert_eq!` directly,
which is especially nice when used with pretty_assertions crate
failures (24):
    events::attributes::html::duplicated::with_check::double_quoted
    events::attributes::html::duplicated::with_check::key_only
    events::attributes::html::duplicated::with_check::single_quoted
    events::attributes::html::duplicated::with_check::unquoted
    events::attributes::html::single::missed_value
    events::attributes::html::sparsed::double_quoted
    events::attributes::html::sparsed::key_contains_invalid
    events::attributes::html::sparsed::key_only
    events::attributes::html::sparsed::key_start_invalid
    events::attributes::html::sparsed::missed_value
    events::attributes::html::sparsed::single_quoted
    events::attributes::html::sparsed::unquoted
    events::attributes::xml::duplicated::with_check::double_quoted
    events::attributes::xml::duplicated::with_check::key_only
    events::attributes::xml::duplicated::with_check::single_quoted
    events::attributes::xml::duplicated::with_check::unquoted
    events::attributes::xml::duplicated::without_check::key_only
    events::attributes::xml::duplicated::without_check::unquoted
    events::attributes::xml::first::key_only
    events::attributes::xml::first::missed_value
    events::attributes::xml::first::unquoted
    events::attributes::xml::single::key_only
    events::attributes::xml::single::missed_value
    events::attributes::xml::sparsed::missed_value
Introduce new `Attr` type that stores not only an attribute content, but also its shape
@Mingun Mingun merged commit 30e0c23 into master May 8, 2022
@Mingun
Copy link
Owner Author

Mingun commented May 8, 2022

@dralley, thanks for your work on reviewing this PR!

@Mingun Mingun deleted the fix-attributes branch May 8, 2022 09:28
@Mingun Mingun added the bug Something isn't working label May 9, 2022
crapStone added a commit to Calciumdibromid/CaBr2 that referenced this pull request May 30, 2022
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [quick-xml](https://github.com/tafia/quick-xml) | dependencies | patch | `0.23.0-alpha3` -> `0.23.0` |
| [quick-xml](https://github.com/tafia/quick-xml) | dependencies | minor | `0.22.0` -> `0.23.0` |

---

### Release Notes

<details>
<summary>tafia/quick-xml</summary>

### [`v0.23.0`](https://github.com/tafia/quick-xml/blob/HEAD/Changelog.md#&#8203;0230----2022-05-08)

-   feat: add support for `i128` / `u128` in attributes or text/CDATA content
-   test: add tests for malformed inputs for serde deserializer
-   fix: allow to deserialize `unit`s from any data in attribute values and text nodes
-   refactor: unify errors when EOF encountered during serde deserialization
-   test: ensure that after deserializing all XML was consumed
-   feat: add `Deserializer::from_str`, `Deserializer::from_slice` and `Deserializer::from_reader`
-   refactor: deprecate `from_bytes` and `Deserializer::from_borrowing_reader` because
    they are fully equivalent to `from_slice` and `Deserializer::new`
-   refactor: reduce number of unnecessary copies when deserialize numbers/booleans/identifiers
    from the attribute and element names and attribute values
-   fix: allow to deserialize `unit`s from text and CDATA content.
    `DeError::InvalidUnit` variant is removed, because after fix it is no longer used
-   fix: `ElementWriter`, introduced in [#&#8203;274](tafia/quick-xml#274)
    (0.23.0-alpha2) now available to end users
-   fix: allow lowercase `<!doctype >` definition (used in HTML 5) when parse document from `&[u8]`
-   test: add tests for consistence behavior of buffered and borrowed readers
-   fix: produce consistent error positions in buffered and borrowed readers
-   feat: `Error::UnexpectedBang` now provide the byte found
-   refactor: unify code for buffered and borrowed readers
-   fix: fix internal panic message when parse malformed XML
    ([#&#8203;344](tafia/quick-xml#344))
-   test: add tests for trivial documents (empty / only comment / `<root>...</root>` -- one tag with content)
-   fix: CDATA was not handled in many cases where it should
-   fix: do not unescape CDATA content because it never escaped by design.
    CDATA event data now represented by its own `BytesCData` type
    ([quick-xml#&#8203;311](tafia/quick-xml#311))
-   feat: add `Reader::get_ref()` and `Reader::get_mut()`, rename
    `Reader::into_underlying_reader()` to `Reader::into_inner()`
-   refactor: now `Attributes::next()` returns a new type `AttrError` when attribute parsing failed
    ([#&#8203;4](Mingun/fast-xml#4))
-   test: properly test all paths of attributes parsing ([#&#8203;4](Mingun/fast-xml#4))
-   feat: attribute iterator now implements `FusedIterator` ([#&#8203;4](Mingun/fast-xml#4))
-   fix: fixed many errors in attribute parsing using iterator, returned from `attributes()`
    or `html_attributes()` ([#&#8203;4](Mingun/fast-xml#4))

</details>

---

### Configuration

📅 **Schedule**: At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about these updates again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, click this checkbox.

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1377
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants