Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework BOM handling and encoding API #399

Merged
merged 15 commits into from Jun 20, 2022
Merged

Rework BOM handling and encoding API #399

merged 15 commits into from Jun 20, 2022

Conversation

Mingun
Copy link
Collaborator

@Mingun Mingun commented Jun 19, 2022

This PR solves several highly-coupled problems, that blocks changing for safe namespace handling (#59). The main reason in that various decode methods accepts a reference to the Reader, but if we move namespace buffer inside the Reader, we could face with borrowing problems (actually, namespace buffer was decoupled from Reader due to this reason in #69).

First, I've fixed #191 by introducing a new StartText event which generated when reader encounters "text" before the XML markup. In well-formed XMLs that text could represent a BOM.

Secondly, I've removed all functions that handle BOM in inappropriate time. For example, there a no sense to handle BOM when you decode attribute values, because you will never get BOM there. For that reason I've moved all BOM handling code to the BytesStartText struct.

Because after that change you can handle BOM only as the first event from the reader, that allowed me to move the automatic BOM-based encoding detection to the reader instead of triggering it by calling BytesText::decode_without_bom (where the name is also incorrect, since the function removes BOM from text and decodes the rest content, so correct one is decode_with_bom_removal).

As encoding API already changed very much, I've also solved the #180 by unify function signature with and without encoding feature enabled.

@Mingun Mingun added bug encoding Issues related to support of various encodings of the XML documents labels Jun 19, 2022
@Mingun Mingun requested a review from dralley June 19, 2022 19:22
@codecov-commenter
Copy link

codecov-commenter commented Jun 19, 2022

Codecov Report

Merging #399 (d49a4b8) into master (3b37c0e) will decrease coverage by 0.42%.
The diff coverage is 75.20%.

@@            Coverage Diff             @@
##           master     #399      +/-   ##
==========================================
- Coverage   61.72%   61.29%   -0.43%     
==========================================
  Files          20       20              
  Lines       10233    10148      -85     
==========================================
- Hits         6316     6220      -96     
- Misses       3917     3928      +11     
Flag Coverage Δ
unittests 61.29% <75.20%> (-0.43%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/de/escape.rs 65.15% <ø> (-1.28%) ⬇️
src/errors.rs 9.52% <0.00%> (-2.85%) ⬇️
src/events/attributes.rs 94.19% <0.00%> (+3.52%) ⬆️
src/lib.rs 21.09% <0.00%> (-4.92%) ⬇️
src/reader.rs 88.38% <79.44%> (-0.02%) ⬇️
src/de/mod.rs 76.14% <80.00%> (+0.87%) ⬆️
src/events/mod.rs 75.57% <80.85%> (+1.23%) ⬆️
src/de/seq.rs 91.83% <100.00%> (-0.76%) ⬇️
src/se/mod.rs 93.81% <100.00%> (-0.01%) ⬇️
src/writer.rs 90.36% <100.00%> (+0.02%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b37c0e...d49a4b8. Read the comment docs.

Changelog.md Outdated Show resolved Hide resolved
Changelog.md Outdated Show resolved Hide resolved
Changelog.md Outdated Show resolved Hide resolved
Mingun and others added 4 commits June 20, 2022 00:53
Because `Decoder` type was private, hardly ever that someone use it
The method `Reader::decoder()` is public anyway, but its result type is not, which means
that it cannot be used as method argument, and that is not good
BOM cannot arise in the attribute values - this is by definition a mark that
can be located only at the begin of the stream and every attribute value is
inside the stream
Holding reference to a reader prevents some usages in the future

Co-authored-by: Daniel Alley <dalley@redhat.com>
Copy link
Collaborator

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do a great job with testing, documentation, splitting up commits and such - I can't find too much to complain about :)

All the changes here look sensible

src/reader.rs Outdated Show resolved Hide resolved
@Mingun
Copy link
Collaborator Author

Mingun commented Jun 20, 2022

@dralley, I tried to reword some sentences in changelog as best as I can, and also add a couple of comments to the TagState. Could you take a look at whether is all ok?

Changelog.md Outdated Show resolved Hide resolved
Changelog.md Outdated Show resolved Hide resolved
@dralley
Copy link
Collaborator

dralley commented Jun 20, 2022

I left some suggestions

@Mingun
Copy link
Collaborator Author

Mingun commented Jun 20, 2022

Thanks! In the final version I've updated benchmarks (they has been broken since #393) and applied you suggestions. Merging now

@Mingun Mingun merged commit 8fa6f1e into tafia:master Jun 20, 2022
@Mingun Mingun deleted the bom branch June 20, 2022 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug encoding Issues related to support of various encodings of the XML documents
Projects
None yet
3 participants