Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper normalize attribute value normalization #379

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dralley
Copy link
Collaborator

@dralley dralley commented Apr 3, 2022

closes #371

@dralley dralley force-pushed the attr-val-normalization branch 6 times, most recently from e45064f to 401bb77 Compare April 3, 2022 15:30
@dralley
Copy link
Collaborator Author

dralley commented Apr 3, 2022

Suggestions :

  • Move this functionality directly to Attribute
  • Provide a fast path to get the raw value

TODO:

  • Character reference & entity reference substitution with associated error handling
    • Figure out what the API needs to look like

@dralley dralley force-pushed the attr-val-normalization branch 5 times, most recently from 538e5cd to ac7b67b Compare June 20, 2022 18:55
@codecov-commenter
Copy link

codecov-commenter commented Jun 20, 2022

Codecov Report

Merging #379 (538e5cd) into master (e701c4d) will increase coverage by 0.10%.
The diff coverage is 86.95%.

❗ Current head 538e5cd differs from pull request most recent head ac7b67b. Consider uploading reports for the commit ac7b67b to get more accurate results

@@            Coverage Diff             @@
##           master     #379      +/-   ##
==========================================
+ Coverage   61.37%   61.48%   +0.10%     
==========================================
  Files          20       20              
  Lines       10157    10229      +72     
==========================================
+ Hits         6234     6289      +55     
- Misses       3923     3940      +17     
Flag Coverage Δ
unittests 61.48% <86.95%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/errors.rs 9.52% <ø> (-2.85%) ⬇️
src/escapei.rs 13.90% <0.00%> (ø)
src/reader.rs 88.36% <ø> (+0.94%) ⬆️
src/events/attributes.rs 94.12% <88.88%> (+3.45%) ⬆️
src/lib.rs 21.09% <0.00%> (-4.92%) ⬇️
src/de/escape.rs 65.15% <0.00%> (-1.28%) ⬇️
src/de/seq.rs 91.83% <0.00%> (-0.76%) ⬇️
src/se/mod.rs 93.81% <0.00%> (-0.01%) ⬇️
src/writer.rs 90.36% <0.00%> (+0.02%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e701c4d...ac7b67b. Read the comment docs.

@dralley
Copy link
Collaborator Author

dralley commented Jun 22, 2022

I plan to continue working on this over the next week or two

@dralley dralley force-pushed the attr-val-normalization branch 2 times, most recently from f206a71 to 1a138d6 Compare June 23, 2022 01:52
@dralley
Copy link
Collaborator Author

dralley commented Jun 23, 2022

@Mingun Questions:

Currently we have the functions unescaped_value, unescaped_value_with_custom_entities and their decode equivalents, that do the escaping part but don't implement the rest of the XML attribute-value-normalization spec. I'm not sure I see any reason for those to continue to exist as far as XML is concerned, but for HTML it makes some sense, as HTML only seems to do unescaping without any other normalization of the value.

  1. Does that sound accurate / align with your knowledge
  2. Do you think it would make more sense to stick with functional names, like normalized_value / unescaped_value and rely on the documentation to tell users what they ought to be using, or switch more descriptive names such as html_value / xml_value
  3. The behavior of Attributes depends on whether or not .html is set, should we look at doing something similar here, or would that not be worth the additional complication

@Mingun
Copy link
Collaborator

Mingun commented Jun 23, 2022

  1. I haven't studied a situation about HTML attributes, therefore, I rely on your understanding of the situation. Then, if you include references to relevant resources in the documentation, I'll be able to learn something about that
  2. I think that functional names are better
  3. Probably we should inverse things and introduce different types for XML / HTML attributes, then implement only relevant methods on each. This will also solve some unpleasant things in the current API -- we can change htmlity of the attributes in the middle of iteration that probably could to lead to sophisticated bugs. I would like to avoid such dangerous usage.

@dralley
Copy link
Collaborator Author

dralley commented Jun 25, 2022

I'm basically going off of the lack of any kind of discussion of attribute value normalization in the HTML living spec, and this discussion on stackoverflow

https://html.spec.whatwg.org/multipage/dom.html#attributes
https://html.spec.whatwg.org/multipage/syntax.html#attributes-2
https://stackoverflow.com/questions/63906320/html5-attribute-value-normalization

I think that functional names are better

I agree

Probably we should inverse things and introduce different types for XML / HTML attributes, then implement only relevant methods on each. This will also solve some unpleasant things in the current API -- we can change htmlity of the attributes in the middle of iteration that probably could to lead to sophisticated bugs. I would like to avoid such dangerous usage.

Can they? It looks like all these fields are private. But, I feel like this is still the best option. There is already attributes() and html_attributes(), it's only a matter of changing the types.

The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate.

@dralley dralley force-pushed the attr-val-normalization branch 4 times, most recently from 08c0eea to 00a37a0 Compare July 4, 2022 17:13
let codepoint = escapei::parse_number(entity, idx..end)?;
escapei::push_utf8(&mut normalized, codepoint);
} else if let Some(value) = custom_entities.and_then(|hm| hm.get(entity)) {
// TODO: recursively apply entity substitution
Copy link
Collaborator Author

@dralley dralley Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mingun Does the normal unescape() function need to do this as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so

benches/microbenches.rs Outdated Show resolved Hide resolved
/// This will allocate unless the raw attribute value does not require normalization.
///
/// See also [`normalized_value_with_custom_entities()`](#method.normalized_value_with_custom_entities)
pub fn normalized_value(&'a self) -> Result<Cow<'a, [u8]>, EscapeError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it will be valuable to add examples here. Actually, probably you can convert your new tests to doc examples and that will be enough

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we should add a decoder parameter to that (and similar) functions and decode first before normalization.

Maybe also try another approach: introduce a new type of always UTF-8 encoded attributes and add a Attribute::decode(&self) -> Result<Utf8Attribute> function.

Instead of completely new type we could try to use const bool generic parameter (stable since 1.51, current MSRV is 1.41.1, from memchr)

src/events/attributes.rs Outdated Show resolved Hide resolved
let codepoint = escapei::parse_number(entity, idx..end)?;
escapei::push_utf8(&mut normalized, codepoint);
} else if let Some(value) = custom_entities.and_then(|hm| hm.get(entity)) {
// TODO: recursively apply entity substitution
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can they? It looks like all these fields are private.

No, they can't. I confused with the ability to disable with_checks in the middle of an iteration.

But, I feel like this is still the best option. There is already attributes() and html_attributes(), it's only a matter of changing the types.

Ok, then let's do that.

The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate.

If you talk about make_normalized_value, you can convert it to a free function and use it in both API methods.

@dralley dralley requested a review from Mingun January 28, 2023 19:47
@dralley dralley force-pushed the attr-val-normalization branch 3 times, most recently from f330695 to c0f0577 Compare January 28, 2023 20:15
@dralley dralley changed the title Properly normalize attribute values (Mostly) properly normalize attribute values Jan 28, 2023
Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed the algorithm and other noted things in my branch, but I do not yet consider it as finished.


group.bench_function("noop_long", |b| {
b.iter(|| {
criterion::black_box(unescape("just a bit of text without any entities")).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe made long really long? A 1KB at least

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope that 1kb attribute values isn't a common thing. But you're right that it should be a bit longer.

Changelog.md Outdated Show resolved Hide resolved
benches/microbenches.rs Outdated Show resolved Hide resolved
src/escapei.rs Outdated Show resolved Hide resolved
src/escapei.rs Outdated Show resolved Hide resolved
src/events/attributes.rs Outdated Show resolved Hide resolved
@dralley dralley changed the title (Mostly) properly normalize attribute values Proper normalize attribute value normalization Jan 30, 2023
@dralley dralley marked this pull request as draft January 30, 2023 06:39
@dralley
Copy link
Collaborator Author

dralley commented Jan 30, 2023

I did a cursory review of the changes and it looks fine. Would you prefer the commits squashed or kept separate?

I'll address everything else tomorrow.

@Mingun Mingun mentioned this pull request Jan 30, 2023
13 tasks
@Mingun
Copy link
Collaborator

Mingun commented Jan 30, 2023

I prefer to kept separate. It is somehow psychologically uncomfortable for review when one commit changes more than ~200 lines in each (or just several) files, even if half of them -- new tests. ¯_(ツ)_/¯

I think, that at least separating normalization method to it's own commit would be a good idea. This is pretty isolated thing which, however, is big enough. That commit needed to be updated:

  • Add tests for non-ASCII input
  • Introduce new error kind and return it when depth become 0. That means that we reach recursion limit
  • (could be postponed) I would like to have limit configurable
  • (could be postponed) Add a way to explicitly detect recursion (i.e. track the resolved entities and report which entity was defined recursively)
  • (could be postponed) Add an ability to pass metainfo around entities. That way we can provide a way to report in error where the erroneous entity is defined, if resolver function provide that information

src/events/attributes.rs Outdated Show resolved Hide resolved
@dralley
Copy link
Collaborator Author

dralley commented Nov 12, 2023

Introduce new error kind and return it when depth become 0. That means that we reach recursion limit

(could be postponed) I would like to have limit configurable

How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider).

Should we consider keeping the state in Reader and doing something along the lines of

  • reader.normalize_attribute_value(attr), or
  • attr.normalize_value_with(resolve_entity, reader) or attr.normalize_value_with(reader) (moving the resolve_entity into Reader entirely)?

Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached. Should that be two separate errors in your opinion, or one error?

@dralley
Copy link
Collaborator Author

dralley commented Nov 15, 2023

@Mingun ^

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider).

Most naturally would be have a new option in reader::Config. That mean, we need somehow propagate it to the actual method. Some methods already takes Reader, so it will simple for them. Maybe we just need to start from only those methods and add shortcuts only when them explicitly will requested.

Actually, I already though about storing Decoder in the attribute itself (but because currently Attribute is a struct with public fields it will be a breaking change and I not very like the idea of making that new decoder field public, because it is implementation detail. Maybe in the end we will store already decoded data)

Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached.

Yes, of course we should return error as soon as we found loop or if recursion limit was exceeded.

Should that be two separate errors in your opinion, or one error?

Two different. libxml2 also have two different errors, as you could notice from your link: one is "Detected an entity reference loop", other is "Maximum entity nesting depth exceeded"

src/escapei.rs Show resolved Hide resolved
src/escapei.rs Show resolved Hide resolved
src/escapei.rs Show resolved Hide resolved
src/escapei.rs Show resolved Hide resolved
Changelog.md Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Attribute values aren't normalized properly
3 participants