Proper normalize attribute value normalization #379

dralley · 2022-04-03T01:16:16Z

closes #371

dralley · 2022-04-03T15:52:47Z

Suggestions :

Move this functionality directly to Attribute
Provide a fast path to get the raw value

TODO:

Character reference & entity reference substitution with associated error handling
- Figure out what the API needs to look like

codecov-commenter · 2022-06-20T19:03:15Z

Codecov Report

Merging #379 (538e5cd) into master (e701c4d) will increase coverage by 0.10%.
The diff coverage is 86.95%.

❗ Current head 538e5cd differs from pull request most recent head ac7b67b. Consider uploading reports for the commit ac7b67b to get more accurate results

@@            Coverage Diff             @@
##           master     #379      +/-   ##
==========================================
+ Coverage   61.37%   61.48%   +0.10%     
==========================================
  Files          20       20              
  Lines       10157    10229      +72     
==========================================
+ Hits         6234     6289      +55     
- Misses       3923     3940      +17

Flag	Coverage Δ
unittests	`61.48% <86.95%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/errors.rs	`9.52% <ø> (-2.85%)`	⬇️
src/escapei.rs	`13.90% <0.00%> (ø)`
src/reader.rs	`88.36% <ø> (+0.94%)`	⬆️
src/events/attributes.rs	`94.12% <88.88%> (+3.45%)`	⬆️
src/lib.rs	`21.09% <0.00%> (-4.92%)`	⬇️
src/de/escape.rs	`65.15% <0.00%> (-1.28%)`	⬇️
src/de/seq.rs	`91.83% <0.00%> (-0.76%)`	⬇️
src/se/mod.rs	`93.81% <0.00%> (-0.01%)`	⬇️
src/writer.rs	`90.36% <0.00%> (+0.02%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e701c4d...ac7b67b. Read the comment docs.

dralley · 2022-06-22T17:53:39Z

I plan to continue working on this over the next week or two

dralley · 2022-06-23T02:16:42Z

@Mingun Questions:

Currently we have the functions unescaped_value, unescaped_value_with_custom_entities and their decode equivalents, that do the escaping part but don't implement the rest of the XML attribute-value-normalization spec. I'm not sure I see any reason for those to continue to exist as far as XML is concerned, but for HTML it makes some sense, as HTML only seems to do unescaping without any other normalization of the value.

Does that sound accurate / align with your knowledge
Do you think it would make more sense to stick with functional names, like normalized_value / unescaped_value and rely on the documentation to tell users what they ought to be using, or switch more descriptive names such as html_value / xml_value
The behavior of Attributes depends on whether or not .html is set, should we look at doing something similar here, or would that not be worth the additional complication

Mingun · 2022-06-23T18:28:36Z

I haven't studied a situation about HTML attributes, therefore, I rely on your understanding of the situation. Then, if you include references to relevant resources in the documentation, I'll be able to learn something about that
I think that functional names are better
Probably we should inverse things and introduce different types for XML / HTML attributes, then implement only relevant methods on each. This will also solve some unpleasant things in the current API -- we can change htmlity of the attributes in the middle of iteration that probably could to lead to sophisticated bugs. I would like to avoid such dangerous usage.

dralley · 2022-06-25T03:10:01Z

I'm basically going off of the lack of any kind of discussion of attribute value normalization in the HTML living spec, and this discussion on stackoverflow

https://html.spec.whatwg.org/multipage/dom.html#attributes
https://html.spec.whatwg.org/multipage/syntax.html#attributes-2
https://stackoverflow.com/questions/63906320/html5-attribute-value-normalization

I think that functional names are better

I agree

Probably we should inverse things and introduce different types for XML / HTML attributes, then implement only relevant methods on each. This will also solve some unpleasant things in the current API -- we can change htmlity of the attributes in the middle of iteration that probably could to lead to sophisticated bugs. I would like to avoid such dangerous usage.

Can they? It looks like all these fields are private. But, I feel like this is still the best option. There is already attributes() and html_attributes(), it's only a matter of changing the types.

The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate.

dralley · 2022-07-04T17:20:01Z

src/events/attributes.rs

+                        let codepoint = escapei::parse_number(entity, idx..end)?;
+                        escapei::push_utf8(&mut normalized, codepoint);
+                    } else if let Some(value) = custom_entities.and_then(|hm| hm.get(entity)) {
+                        // TODO: recursively apply entity substitution


@Mingun Does the normal unescape() function need to do this as well?

Yes, I think so

benches/microbenches.rs

Mingun · 2022-07-08T06:02:34Z

src/events/attributes.rs

+    /// This will allocate unless the raw attribute value does not require normalization.
+    ///
+    /// See also [`normalized_value_with_custom_entities()`](#method.normalized_value_with_custom_entities)
+    pub fn normalized_value(&'a self) -> Result<Cow<'a, [u8]>, EscapeError> {


I think, it will be valuable to add examples here. Actually, probably you can convert your new tests to doc examples and that will be enough

Actually, we should add a decoder parameter to that (and similar) functions and decode first before normalization.

Maybe also try another approach: introduce a new type of always UTF-8 encoded attributes and add a Attribute::decode(&self) -> Result<Utf8Attribute> function.

Instead of completely new type we could try to use const bool generic parameter (stable since 1.51, current MSRV is 1.41.1, from memchr)

src/events/attributes.rs

Mingun · 2022-07-08T06:09:35Z

src/events/attributes.rs

+                        let codepoint = escapei::parse_number(entity, idx..end)?;
+                        escapei::push_utf8(&mut normalized, codepoint);
+                    } else if let Some(value) = custom_entities.and_then(|hm| hm.get(entity)) {
+                        // TODO: recursively apply entity substitution


Yes, I think so

Mingun

Can they? It looks like all these fields are private.

No, they can't. I confused with the ability to disable with_checks in the middle of an iteration.

But, I feel like this is still the best option. There is already attributes() and html_attributes(), it's only a matter of changing the types.

Ok, then let's do that.

The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate.

If you talk about make_normalized_value, you can convert it to a free function and use it in both API methods.

Mingun

I've fixed the algorithm and other noted things in my branch, but I do not yet consider it as finished.

Mingun · 2023-01-29T11:34:31Z

benches/microbenches.rs

+
+    group.bench_function("noop_long", |b| {
+        b.iter(|| {
+            criterion::black_box(unescape("just a bit of text without any entities")).unwrap();


Maybe made long really long? A 1KB at least

I would hope that 1kb attribute values isn't a common thing. But you're right that it should be a bit longer.

Changelog.md

benches/microbenches.rs

src/escapei.rs

src/events/attributes.rs

dralley · 2023-01-30T06:43:57Z

I did a cursory review of the changes and it looks fine. Would you prefer the commits squashed or kept separate?

I'll address everything else tomorrow.

Mingun · 2023-01-30T14:34:15Z

I prefer to kept separate. It is somehow psychologically uncomfortable for review when one commit changes more than ~200 lines in each (or just several) files, even if half of them -- new tests. ¯_(ツ)_/¯

I think, that at least separating normalization method to it's own commit would be a good idea. This is pretty isolated thing which, however, is big enough. That commit needed to be updated:

Add tests for non-ASCII input
Introduce new error kind and return it when depth become 0. That means that we reach recursion limit
(could be postponed) I would like to have limit configurable
(could be postponed) Add a way to explicitly detect recursion (i.e. track the resolved entities and report which entity was defined recursively)
(could be postponed) Add an ability to pass metainfo around entities. That way we can provide a way to report in error where the erroneous entity is defined, if resolver function provide that information

src/events/attributes.rs

closes tafia#371

dralley · 2023-11-12T16:21:22Z

Introduce new error kind and return it when depth become 0. That means that we reach recursion limit

(could be postponed) I would like to have limit configurable

How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider).

Should we consider keeping the state in Reader and doing something along the lines of

reader.normalize_attribute_value(attr), or
attr.normalize_value_with(resolve_entity, reader) or attr.normalize_value_with(reader) (moving the resolve_entity into Reader entirely)?

Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached. Should that be two separate errors in your opinion, or one error?

dralley · 2023-11-15T05:40:33Z

@Mingun ^

Mingun

How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider).

Most naturally would be have a new option in reader::Config. That mean, we need somehow propagate it to the actual method. Some methods already takes Reader, so it will simple for them. Maybe we just need to start from only those methods and add shortcuts only when them explicitly will requested.

Actually, I already though about storing Decoder in the attribute itself (but because currently Attribute is a struct with public fields it will be a breaking change and I not very like the idea of making that new decoder field public, because it is implementation detail. Maybe in the end we will store already decoded data)

Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached.

Yes, of course we should return error as soon as we found loop or if recursion limit was exceeded.

Should that be two separate errors in your opinion, or one error?

Two different. libxml2 also have two different errors, as you could notice from your link: one is "Detected an entity reference loop", other is "Maximum entity nesting depth exceeded"

src/escapei.rs

Changelog.md

dralley force-pushed the attr-val-normalization branch 6 times, most recently from e45064f to 401bb77 Compare April 3, 2022 15:30

dralley force-pushed the attr-val-normalization branch from 401bb77 to 9307786 Compare April 3, 2022 18:01

dralley force-pushed the attr-val-normalization branch 5 times, most recently from 538e5cd to ac7b67b Compare June 20, 2022 18:55

dralley force-pushed the attr-val-normalization branch 2 times, most recently from f206a71 to 1a138d6 Compare June 23, 2022 01:52

dralley force-pushed the attr-val-normalization branch 4 times, most recently from 08c0eea to 00a37a0 Compare July 4, 2022 17:13

dralley commented Jul 4, 2022

View reviewed changes

benches/microbenches.rs Outdated Show resolved Hide resolved

Mingun reviewed Jul 8, 2022

View reviewed changes

Mingun mentioned this pull request Jul 9, 2022

Make attribute creation more uniform #413

Closed

dralley mentioned this pull request Jul 10, 2022

Closure-based unescaping with custom entities #415

Merged

dralley force-pushed the attr-val-normalization branch from 4ef9587 to b487e76 Compare January 28, 2023 19:47

dralley requested a review from Mingun January 28, 2023 19:47

dralley force-pushed the attr-val-normalization branch 3 times, most recently from f330695 to c0f0577 Compare January 28, 2023 20:15

dralley changed the title ~~Properly normalize attribute values~~ (Mostly) properly normalize attribute values Jan 28, 2023

dralley force-pushed the attr-val-normalization branch from c0f0577 to 34c2f72 Compare January 28, 2023 20:38

Mingun requested changes Jan 29, 2023

View reviewed changes

dralley changed the title ~~(Mostly) properly normalize attribute values~~ Proper normalize attribute value normalization Jan 30, 2023

dralley marked this pull request as draft January 30, 2023 06:39

dralley force-pushed the attr-val-normalization branch from 34c2f72 to 3874791 Compare January 30, 2023 06:40

Mingun mentioned this pull request Jan 30, 2023

Release 0.28.0 #549

Closed

13 tasks

dralley force-pushed the attr-val-normalization branch 2 times, most recently from 7f55cd8 to add31b6 Compare January 31, 2023 04:46

dralley force-pushed the attr-val-normalization branch from add31b6 to a72a441 Compare March 13, 2023 00:57

dralley force-pushed the attr-val-normalization branch 2 times, most recently from f317d76 to 69a1934 Compare June 19, 2023 22:48

dralley force-pushed the attr-val-normalization branch from 69a1934 to ff42db2 Compare July 10, 2023 19:40

dralley force-pushed the attr-val-normalization branch from ff42db2 to deed851 Compare August 11, 2023 01:23

dralley mentioned this pull request Oct 7, 2023

are there some stuff here that need some help? dralley/rpmrepo_metadata#2

Open

francisdb mentioned this pull request Oct 20, 2023

xml serde roundtrip loses CR/LF encoding #670

Open

dralley force-pushed the attr-val-normalization branch from deed851 to def940d Compare October 23, 2023 03:21

dralley commented Oct 23, 2023

View reviewed changes

src/events/attributes.rs Outdated Show resolved Hide resolved

Add functions for attribute value normalization

5817baf

closes tafia#371

dralley force-pushed the attr-val-normalization branch from def940d to 5817baf Compare November 12, 2023 05:52

Mingun reviewed Nov 15, 2023

View reviewed changes

src/escapei.rs Show resolved Hide resolved

src/escapei.rs Show resolved Hide resolved

src/escapei.rs Show resolved Hide resolved

src/escapei.rs Show resolved Hide resolved

Changelog.md Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper normalize attribute value normalization #379

Proper normalize attribute value normalization #379

dralley commented Apr 3, 2022

dralley commented Apr 3, 2022 •

edited

codecov-commenter commented Jun 20, 2022 •

edited

dralley commented Jun 22, 2022

dralley commented Jun 23, 2022 •

edited

Mingun commented Jun 23, 2022 •

edited

dralley commented Jun 25, 2022 •

edited

dralley Jul 4, 2022 •

edited

Mingun Jul 8, 2022

Mingun Jul 8, 2022

Mingun Jul 9, 2022

Mingun Jul 8, 2022

Mingun left a comment

Mingun left a comment

Mingun Jan 29, 2023

dralley Jan 31, 2023

dralley commented Jan 30, 2023

Mingun commented Jan 30, 2023

dralley commented Nov 12, 2023 •

edited

dralley commented Nov 15, 2023

Mingun left a comment

Proper normalize attribute value normalization #379

Are you sure you want to change the base?

Proper normalize attribute value normalization #379

Conversation

dralley commented Apr 3, 2022

dralley commented Apr 3, 2022 • edited

codecov-commenter commented Jun 20, 2022 • edited

Codecov Report

dralley commented Jun 22, 2022

dralley commented Jun 23, 2022 • edited

Mingun commented Jun 23, 2022 • edited

dralley commented Jun 25, 2022 • edited

dralley Jul 4, 2022 • edited

Choose a reason for hiding this comment

Mingun Jul 8, 2022

Choose a reason for hiding this comment

Mingun Jul 8, 2022

Choose a reason for hiding this comment

Mingun Jul 9, 2022

Choose a reason for hiding this comment

Mingun Jul 8, 2022

Choose a reason for hiding this comment

Mingun left a comment

Choose a reason for hiding this comment

Mingun left a comment

Choose a reason for hiding this comment

Mingun Jan 29, 2023

Choose a reason for hiding this comment

dralley Jan 31, 2023

Choose a reason for hiding this comment

dralley commented Jan 30, 2023

Mingun commented Jan 30, 2023

dralley commented Nov 12, 2023 • edited

dralley commented Nov 15, 2023

Mingun left a comment

Choose a reason for hiding this comment

dralley commented Apr 3, 2022 •

edited

codecov-commenter commented Jun 20, 2022 •

edited

dralley commented Jun 23, 2022 •

edited

Mingun commented Jun 23, 2022 •

edited

dralley commented Jun 25, 2022 •

edited

dralley Jul 4, 2022 •

edited

dralley commented Nov 12, 2023 •

edited