[work-in-progress] Decoding BufReader implementation #441

dralley · 2022-07-25T03:33:39Z

Very, very work-in-progress. Doesn't compile yet. Only posting it for transparency.

codecov-commenter · 2022-08-14T01:31:56Z

Codecov Report

Merging #441 (10038ca) into master (90be3ee) will increase coverage by 0.69%.
The diff coverage is 83.63%.

@@            Coverage Diff             @@
##           master     #441      +/-   ##
==========================================
+ Coverage   52.12%   52.81%   +0.69%     
==========================================
  Files          28       28              
  Lines       13480    13392      -88     
==========================================
+ Hits         7026     7073      +47     
+ Misses       6454     6319     -135

Flag	Coverage Δ
unittests	`52.81% <83.63%> (+0.69%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
benches/microbenches.rs	`0.00% <0.00%> (ø)`
examples/custom_entities.rs	`0.00% <0.00%> (ø)`
examples/nested_readers.rs	`0.00% <0.00%> (ø)`
examples/read_buffered.rs	`0.00% <0.00%> (ø)`
examples/read_texts.rs	`0.00% <0.00%> (ø)`
src/errors.rs	`2.08% <0.00%> (+0.01%)`	⬆️
src/reader/buffered_reader.rs	`69.89% <0.00%> (-6.25%)`	⬇️
src/writer.rs	`48.30% <0.00%> (-0.28%)`	⬇️
src/de/var.rs	`75.00% <11.11%> (-10.11%)`	⬇️
src/de/escape.rs	`19.84% <40.00%> (-1.21%)`	⬇️
... and 26 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Mingun

I like the idea to create a separate test file for encodings, so the commit 2 is good for me. The others lets hold until your finish your work. I'm not sure, that we should remove StartText event, because it still can be useful even if will not contain BOM. The text before the declaration is not considered as a part of XML model, so it should be treated differently, for example, the unescaping of that text has no meaning.

I suggest firstly to add tests for all possible encodings -- because the set of encodings is finite and not so big, we can create a bunch of documents in each encoding and test their parsing. The tests that I would like to see:

detecting encoding (I found that we do not do that correctly, I'll plan to look at this before release, I already made some investigations and improvements, but not yet proved them with tests)
reading events

The XMLs should looks like:

<?xml encoding="<encoding>"?>
<root attribute1=value1
      attribute2="value1"
      attribute3='value2'
>
  <?instruction?>
  <!--Comment-->
  text
  <ns:element ns:attribute="value1" xmlns:ns="namespace"/>
  <[[CDATA[[cdata]]>
</root>

Because almost all supported encodings are one-byte encodings, we could create XML files with all possible symbols (probably except that explicitly forbidden by the XML standard). The national symbols should be presented in all possible places (tag names, namespace names, attribute names and values, text and CDATA content, comments, processing instructions).

Only after that, I believe, we should made any improvements in encoding support.

tests/encodings.rs

dralley · 2022-08-29T01:31:02Z

The basic functionality is pretty much working. TODO:

encoding detection regression for UTF-16 (failing tests)
async support in Utf8BytesReader
UTF-8 validation inside Utf8BytesReader when the encoding feature is turned off
some other cleanup

Async support is the real pain unfortunately.

dralley · 2022-08-29T02:51:59Z

src/reader/mod.rs

 impl<R> Reader<R> {
-    /// Consumes `Reader` returning the underlying reader
-    ///
-    /// Can be used to compute line and column of a parsing error position


I'm thinking we really need a better solution than this. Either by tracking newlines during parsing or by automatically implementing something like this for &str and perhaps File.

I'm not sure about the overhead of the former approach, might be too expensive.

Mingun

Left some preliminary comments

Mingun · 2022-08-29T05:45:45Z

Cargo.toml

@@ -39,7 +39,7 @@ name = "macrobenches"
 harness = false

 [features]
-default = []
+default = ["encoding"]


Yes, I think we could make this default.

Changelog.md

Mingun · 2022-08-29T05:59:26Z

src/reader/mod.rs

@@ -1612,7 +1588,7 @@ mod test {
                /// character should be stripped for consistency
                #[$test]
                $($async)? fn bom_from_reader() {
-                    let mut reader = Reader::from_reader("\u{feff}\u{feff}".as_bytes());
+                    let mut reader = Reader::from_str("\u{feff}\u{feff}");


This test is intentionally uses from_reader (which its name say). Changing that you make it the same as the test just below it. You should either delete this change or the entire test

Then it needs to be removed, because this variant of this method cannot be implemented for SliceReader.

dralley · 2022-09-05T15:08:32Z

Status update: I'm still working on this, but making PRs against https://github.com/BurntSushi/encoding_rs_io/

dralley · 2022-09-14T19:06:56Z

And also hsivonen/encoding_rs#88

I'm not sure how long I'm going to be blocked on that, so if there is a portion of the work you think could be split off and merged, I'd appreciate that. Otherwise I'll probably work on the attribute normalization PR.

dralley · 2022-09-23T02:51:56Z

So, as a practical matter I now think we need to strip the BOM in all situations, no matter what.

Not because there's anything invalid about it at a purely technical level, but because buried deep down on page 157 out of 1054 of the Unicode standard it says that for UTF-16 the BOM "should not be considered part of the text". It actually says nothing about that for UTF-8 and in fact implies the opposite, but that line appears to have been the basis for encoding_rs to remove the BOM from the stream for both UTF-16 and UTF-8.

Realistically I don't think we should cut against the grain of what encoding_rs is doing, it's probably a sane decision for consistency purposes anyway, and we need to get the same output in both the encoding and not-encoding cases - so that means the BOM should never be present.

Mingun · 2022-09-23T05:55:02Z

Do what you think right, just explain decisions in the comments in the code

dralley · 2022-10-26T17:54:45Z

@Mingun I'm just going to rip out any changes to the serde code and do that absolutely last, so I don't need to keep redoing it

It's been a month without any further progress on getting BurntSushi/encoding_rs_io#11 merged, so I'm just going to vendor the code internally for now. It's not that much code and he was very skeptical about merging async support anyway, which we're kind of obligated to support now.

This code wouldn't have worked on any other encoding anyway

When the source of the bytes isn't UTF-8 (or isn't known to be), the bytes need to be decoded first, or at least validated as such. Wrap 'Read'ers with Utf8BytesReader to ensure this happens. Defer the validating portion for now.

This is inside the check! macro, meaning it gets implemented per reader type, but from_reader() doesn't apply to SliceReader

As quick-xml will pre-decode everything, it is unnecessary.

dralley force-pushed the decoding-reader branch from 8b57c07 to b16c52e Compare July 25, 2022 04:34