Allow `` values #496

sashka · 2022-10-26T20:42:26Z

Do not return error on parsing preperly escaped zero values, which are perfectly valid. Add simple test to check it.

Do not return error on parsing preperly escaped zero values. Add simple test to check it.

dralley · 2022-10-26T21:06:13Z

I'm not sure this is true

http://lists.xml.org/archives/xml-dev/201211/msg00014.html
http://lists.xml.org/archives/xml-dev/201211/msg00012.html

codecov-commenter · 2022-10-26T21:15:52Z

Codecov Report

Merging #496 (aa932fd) into master (0c6eeb0) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #496      +/-   ##
==========================================
- Coverage   61.70%   61.67%   -0.04%     
==========================================
  Files          33       33              
  Lines       15343    15337       -6     
==========================================
- Hits         9468     9459       -9     
- Misses       5875     5878       +3

Flag	Coverage Δ
unittests	`61.67% <100.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/escapei.rs	`12.40% <100.00%> (-0.14%)`	⬇️
src/de/mod.rs	`66.95% <0.00%> (-0.29%)`	⬇️
src/lib.rs	`13.17% <0.00%> (-0.08%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

sashka · 2022-10-26T21:25:36Z

According to https://www.w3.org/TR/REC-xml/ the value of  has never stated invalid.
Got it: in TR XML 1.1 there's a statement "#x0 is still forbidden both directly and as a character reference."

I'll close this PR and will vendor a fork for internal needs then.

dralley · 2022-10-26T21:41:34Z

Do your internal needs require handling a lot of documents with that issue? Can you speak about the source & purpose of said documents? The "API concerns" I have to imagine are related to C-strings (with null termination) and not strictly relevant to Rust. However, there would be nothing preventing Rust from reading such documents, even though we shouldn't allow them to be output.

(That is not to say that quick-xml should or will support them, but it could be discussed. I don't know what other XML libraries are doing but I imagine Java and C# ones may not always implement the checks since they use UTF-16 encodings)

sashka · 2022-10-26T22:33:22Z

There are lots of huge CPU hardware-related documents (dozens of gigabytes), and there are properly escaped nulls. Minor example of the same kind is Apple's keyboard layout XML.

In both cases I have to read those XMLs exclusively by Rust, and I never write XML, so in my practice I would never face an error on attempt to write NULL.

On the other hand, I may put read-write for NULL under a feature flag to allow this use case when a user (like me or my engineers) sure they needs that.

dralley · 2022-10-27T03:19:27Z

@Mingun (and @tafia), what are your thoughts about allowing nominally specification-breaking, but widespread behavior like this?

Is this something that might be acceptable under a feature flag?

Mingun · 2022-10-27T04:27:46Z

Yes, I actually does not understand, why XML forbid character references to some characters in the first place, and I'm glad to allow them. This ability under a feature flag would be welcome.

@sashka, can you add a properly documented feature flag (say, allow-restricted-chars) and add tests for all restricted characters (as defined in the rule RestrictedChar in https://www.w3.org/TR/xml11/#charsets) + ? If we actually already allow some of that ranges (I looks like we allow), then I think we should forbid they usage without a feature flag to be standard compatible. Then rename EntityWithNull to RestrictedChar, add a doc reference to the standard. We use XML 1.1

dralley · 2022-10-27T04:46:21Z

Let's open an issue and collect a bit more information before we merge anything - for instance, what do other XML libraries do.

It looks like nokogiri may have at one point just stripped these characters out instead of raising an error. However, attempting to write XML with those characters will raise a fatal error.

On the other hand it looks like both Python's ElementTree and libxml2 raise a hard error even while parsing

In [24]: xml = """<?xml version="1.0"?>
    ...: <data>
    ...:     <country>foobar &#0;</country>
    ...: </data>
    ...: """

In [25]: libxml2.parseDoc(xml)
Entity: line 3: parser error : xmlParseCharRef: invalid xmlChar value 0
    <country>foobar &#0;</country>
                        ^

In [3]: ET.fromstring(xml)
Traceback (most recent call last):

  Input In [3] in <cell line: 1>
    ET.fromstring(xml)

ParseError: reference to invalid character number: line 3, column 20

I believe Python's XML parsing is based on Expat, so between that and libxml2, that covers 2 of the most-used XML libraries.

Are the NUL characters in these documents "load-bearing"? Would they be unsafe to strip out?

Mingun · 2022-10-27T17:12:08Z

The following behaviors are possible:

emit hard error (= stop parsing / writing and return error)
silently skip invalid characters (or at least log them, but we currently does not depend on any logger)
accept invalid characters

Each behavior could be applied to a raw or an escaped form of characters.

I think, that behavior 2 is an anti-pattern and we should not allow it. I cannot imagine a situation where you would need to silently change the data that you operate. If you for whatever reason need that, then you can done that manually in Writer API and we can add an API in serde Serializer for that.

The behaviors 1 and 3 could be chosen either via feature flag, or via a run-time option. I'll agree with any of the options or even a combination of them.

I didn't have situations where I needed a dynamic choice of this option, so the feature flag should be enough in the first stage and easier to implement.

The choice between support only escaped form, or both raw and escaped form also could be leaved as implementation details in the first stage. Ah, and I suspect, that currently we forbid only escaped form of , but not raw \0.

There is also a technical question for a feature flag: it should allow restricted chars, forbidden without them, or it should restrict chars, allowed without them? In both cases the flag is additive is some sense, but I think that the allowing behavior is better. Of course, we could have both flags, but does not allow to enable them simultaneously, but I think that is overkill for now.

Any thoughts?

Mingun · 2022-10-27T17:19:07Z

Let's open an issue and collect a bit more information before we merge anything - for instance, what do other XML libraries do.

Actually, you've already open it -- #368.

sashka · 2022-10-27T17:40:36Z

Surprisingly, the only restricted character which is actually restricted is the escaped NULL:

fn test_restricted_chars_unescape() {
    assert_eq!(unescape("\u{0}").unwrap(), "\u{0}");
    // assert_eq!(unescape("&#x0;").unwrap(), "\u{0}"); // <-- this assert will fail with version 0.26.0

    assert_eq!(unescape("\u{1}").unwrap(), "\u{1}");
    assert_eq!(unescape("&#x1;").unwrap(), "\u{1}");
    assert_eq!(unescape("\u{2}").unwrap(), "\u{2}");
    assert_eq!(unescape("&#x2;").unwrap(), "\u{2}");
    assert_eq!(unescape("\u{3}").unwrap(), "\u{3}");
    assert_eq!(unescape("&#x3;").unwrap(), "\u{3}");
    assert_eq!(unescape("\u{4}").unwrap(), "\u{4}");
    assert_eq!(unescape("&#x4;").unwrap(), "\u{4}");
    assert_eq!(unescape("\u{5}").unwrap(), "\u{5}");
    assert_eq!(unescape("&#x5;").unwrap(), "\u{5}");
    assert_eq!(unescape("\u{6}").unwrap(), "\u{6}");
    assert_eq!(unescape("&#x6;").unwrap(), "\u{6}");
    assert_eq!(unescape("\u{7}").unwrap(), "\u{7}");
    assert_eq!(unescape("&#x7;").unwrap(), "\u{7}");
    assert_eq!(unescape("\u{8}").unwrap(), "\u{8}");
    assert_eq!(unescape("&#x8;").unwrap(), "\u{8}");

    assert_eq!(unescape("\u{b}").unwrap(), "\u{b}");
    assert_eq!(unescape("&#xB;").unwrap(), "\u{b}");
    assert_eq!(unescape("\u{c}").unwrap(), "\u{c}");
    assert_eq!(unescape("&#xC;").unwrap(), "\u{c}");

    assert_eq!(unescape("\u{e}").unwrap(), "\u{e}");
    assert_eq!(unescape("&#xE;").unwrap(), "\u{e}");
    assert_eq!(unescape("\u{f}").unwrap(), "\u{f}");
    assert_eq!(unescape("&#xF;").unwrap(), "\u{f}");
    assert_eq!(unescape("\u{10}").unwrap(), "\u{10}");
    assert_eq!(unescape("&#x10;").unwrap(), "\u{10}");
    assert_eq!(unescape("\u{11}").unwrap(), "\u{11}");
    assert_eq!(unescape("&#x11;").unwrap(), "\u{11}");
    assert_eq!(unescape("\u{12}").unwrap(), "\u{12}");
    assert_eq!(unescape("&#x12;").unwrap(), "\u{12}");
    assert_eq!(unescape("\u{13}").unwrap(), "\u{13}");
    assert_eq!(unescape("&#x13;").unwrap(), "\u{13}");
    assert_eq!(unescape("\u{14}").unwrap(), "\u{14}");
    assert_eq!(unescape("&#x14;").unwrap(), "\u{14}");
    assert_eq!(unescape("\u{15}").unwrap(), "\u{15}");
    assert_eq!(unescape("&#x15;").unwrap(), "\u{15}");
    assert_eq!(unescape("\u{16}").unwrap(), "\u{16}");
    assert_eq!(unescape("&#x16;").unwrap(), "\u{16}");
    assert_eq!(unescape("\u{17}").unwrap(), "\u{17}");
    assert_eq!(unescape("&#x17;").unwrap(), "\u{17}");
    assert_eq!(unescape("\u{18}").unwrap(), "\u{18}");
    assert_eq!(unescape("&#x18;").unwrap(), "\u{18}");
    assert_eq!(unescape("\u{19}").unwrap(), "\u{19}");
    assert_eq!(unescape("&#x19;").unwrap(), "\u{19}");
    assert_eq!(unescape("\u{1a}").unwrap(), "\u{1a}");
    assert_eq!(unescape("&#x1A;").unwrap(), "\u{1a}");
    assert_eq!(unescape("\u{1b}").unwrap(), "\u{1b}");
    assert_eq!(unescape("&#x1B;").unwrap(), "\u{1b}");
    assert_eq!(unescape("\u{1c}").unwrap(), "\u{1c}");
    assert_eq!(unescape("&#x1C;").unwrap(), "\u{1c}");
    assert_eq!(unescape("\u{1d}").unwrap(), "\u{1d}");
    assert_eq!(unescape("&#x1D;").unwrap(), "\u{1d}");
    assert_eq!(unescape("\u{1e}").unwrap(), "\u{1e}");
    assert_eq!(unescape("&#x1E;").unwrap(), "\u{1e}");
    assert_eq!(unescape("\u{1f}").unwrap(), "\u{1f}");
    assert_eq!(unescape("&#x1F;").unwrap(), "\u{1f}");

    assert_eq!(unescape("\u{7f}").unwrap(), "\u{7f}");
    assert_eq!(unescape("&#x7F;").unwrap(), "\u{7f}");
    assert_eq!(unescape("\u{80}").unwrap(), "\u{80}");
    assert_eq!(unescape("&#x80;").unwrap(), "\u{80}");
    assert_eq!(unescape("\u{81}").unwrap(), "\u{81}");
    assert_eq!(unescape("&#x81;").unwrap(), "\u{81}");
    assert_eq!(unescape("\u{82}").unwrap(), "\u{82}");
    assert_eq!(unescape("&#x82;").unwrap(), "\u{82}");
    assert_eq!(unescape("\u{83}").unwrap(), "\u{83}");
    assert_eq!(unescape("&#x83;").unwrap(), "\u{83}");
    assert_eq!(unescape("\u{84}").unwrap(), "\u{84}");
    assert_eq!(unescape("&#x84;").unwrap(), "\u{84}");

    assert_eq!(unescape("\u{86}").unwrap(), "\u{86}");
    assert_eq!(unescape("&#x86;").unwrap(), "\u{86}");
    assert_eq!(unescape("\u{87}").unwrap(), "\u{87}");
    assert_eq!(unescape("&#x87;").unwrap(), "\u{87}");
    assert_eq!(unescape("\u{88}").unwrap(), "\u{88}");
    assert_eq!(unescape("&#x88;").unwrap(), "\u{88}");
    assert_eq!(unescape("\u{89}").unwrap(), "\u{89}");
    assert_eq!(unescape("&#x89;").unwrap(), "\u{89}");
    assert_eq!(unescape("\u{8a}").unwrap(), "\u{8a}");
    assert_eq!(unescape("&#x8A;").unwrap(), "\u{8a}");
    assert_eq!(unescape("\u{8b}").unwrap(), "\u{8b}");
    assert_eq!(unescape("&#x8B;").unwrap(), "\u{8b}");
    assert_eq!(unescape("\u{8c}").unwrap(), "\u{8c}");
    assert_eq!(unescape("&#x8C;").unwrap(), "\u{8c}");
    assert_eq!(unescape("\u{8d}").unwrap(), "\u{8d}");
    assert_eq!(unescape("&#x8D;").unwrap(), "\u{8d}");
    assert_eq!(unescape("\u{8e}").unwrap(), "\u{8e}");
    assert_eq!(unescape("&#x8E;").unwrap(), "\u{8e}");
    assert_eq!(unescape("\u{8f}").unwrap(), "\u{8f}");
    assert_eq!(unescape("&#x8F;").unwrap(), "\u{8f}");
    assert_eq!(unescape("\u{90}").unwrap(), "\u{90}");
    assert_eq!(unescape("&#x90;").unwrap(), "\u{90}");
    assert_eq!(unescape("\u{91}").unwrap(), "\u{91}");
    assert_eq!(unescape("&#x91;").unwrap(), "\u{91}");
    assert_eq!(unescape("\u{92}").unwrap(), "\u{92}");
    assert_eq!(unescape("&#x92;").unwrap(), "\u{92}");
    assert_eq!(unescape("\u{93}").unwrap(), "\u{93}");
    assert_eq!(unescape("&#x93;").unwrap(), "\u{93}");
    assert_eq!(unescape("\u{94}").unwrap(), "\u{94}");
    assert_eq!(unescape("&#x94;").unwrap(), "\u{94}");
    assert_eq!(unescape("\u{95}").unwrap(), "\u{95}");
    assert_eq!(unescape("&#x95;").unwrap(), "\u{95}");
    assert_eq!(unescape("\u{96}").unwrap(), "\u{96}");
    assert_eq!(unescape("&#x96;").unwrap(), "\u{96}");
    assert_eq!(unescape("\u{97}").unwrap(), "\u{97}");
    assert_eq!(unescape("&#x97;").unwrap(), "\u{97}");
    assert_eq!(unescape("\u{98}").unwrap(), "\u{98}");
    assert_eq!(unescape("&#x98;").unwrap(), "\u{98}");
    assert_eq!(unescape("\u{99}").unwrap(), "\u{99}");
    assert_eq!(unescape("&#x99;").unwrap(), "\u{99}");
    assert_eq!(unescape("\u{9a}").unwrap(), "\u{9a}");
    assert_eq!(unescape("&#x9A;").unwrap(), "\u{9a}");
    assert_eq!(unescape("\u{9b}").unwrap(), "\u{9b}");
    assert_eq!(unescape("&#x9B;").unwrap(), "\u{9b}");
    assert_eq!(unescape("\u{9c}").unwrap(), "\u{9c}");
    assert_eq!(unescape("&#x9C;").unwrap(), "\u{9c}");
    assert_eq!(unescape("\u{9d}").unwrap(), "\u{9d}");
    assert_eq!(unescape("&#x9D;").unwrap(), "\u{9d}");
    assert_eq!(unescape("\u{9e}").unwrap(), "\u{9e}");
    assert_eq!(unescape("&#x9E;").unwrap(), "\u{9e}");
    assert_eq!(unescape("\u{9f}").unwrap(), "\u{9f}");
    assert_eq!(unescape("&#x9F;").unwrap(), "\u{9f}");
}

dralley · 2022-10-27T18:27:01Z

It appears that C#'s dotnet XML library has a flag for this: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreadersettings.checkcharacters?view=net-6.0

Same for writing: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlwritersettings.checkcharacters?view=net-6.0

Surprisingly, the only restricted character which is actually restricted is the escaped NULL:

I'm not surprised given I filed #368. It's something we needed to improve anyways. Do you have any avenue to report the invalid XML? Obviously that's not an immediate solution, I am just curious.

sashka · 2022-10-27T22:04:05Z

I'll keep all the work for allow-restricted-chars right there: https://github.com/sashka/quick-xml/tree/allow-restricted-chars

Do you have any avenue to report the invalid XML? Obviously that's not an immediate solution, I am just curious.

No, vendors don't care about general XML validity as soon as their XML files are not for public distribution, and their own tools process their own files perfectly.

dralley · 2022-10-28T05:54:13Z

The following behaviors are possible:

emit hard error (= stop parsing / writing and return error)

silently skip invalid characters (or at least log them, but we currently does not depend on any logger)

accept invalid characters

Each behavior could be applied to a raw or an escaped form of characters.

I think, that behavior 2 is an anti-pattern and we should not allow it. I cannot imagine a situation where you would need to silently change the data that you operate. If you for whatever reason need that, then you can done that manually in Writer API and we can add an API in serde Serializer for that.

Agree that we shouldn't skip over invalid data. Seems like a terrible approach.

The behaviors 1 and 3 could be chosen either via feature flag, or via a run-time option. I'll agree with any of the options or even a combination of them.

I didn't have situations where I needed a dynamic choice of this option, so the feature flag should be enough in the first stage and easier to implement.

The choice between support only escaped form, or both raw and escaped form also could be leaved as implementation details in the first stage. Ah, and I suspect, that currently we forbid only escaped form of �, but not raw \0.

There is also a technical question for a feature flag: it should allow restricted chars, forbidden without them, or it should restrict chars, allowed without them? In both cases the flag is additive is some sense, but I think that the allowing behavior is better. Of course, we could have both flags, but does not allow to enable them simultaneously, but I think that is overkill for now.

Any thoughts?

An initial conservative approach could be to just provide an advanced version of the various escape / unescape functions which would allow you to configure the behavior either way at runtime. I think I would prefer that approach over a feature flag.

Agreed that we're still missing the checks for these characters in the raw form and that we ought to have them.

Reading through the MS docs again, it appears that CheckCharacters applies only to character entities. Names and raw text still need to be completely valid and contain no illegal characters no matter what - and it's checked always. @sashka I assume all of the illegal characters are at least encoded as character references and not present in raw form? If that's the case then copying that approach would be sufficient. And as the configuration would only apply to (un)escaping, there wouldn't be issues with needing a pseudo-global configuration.

dralley · 2022-11-09T00:35:29Z

Does that sound right @sashka ?

sashka · 2022-11-10T16:36:08Z

@dralley, yes, it sounds right for me. I'm working on proper reading mechanics now.

dralley · 2023-07-10T20:55:00Z

@sashka Do you plan to continue with this?

Allow  values

aa932fd

Do not return error on parsing preperly escaped zero values. Add simple test to check it.

sashka closed this Oct 26, 2022

Mingun reopened this Oct 27, 2022

Mingun added the enhancement label Oct 27, 2022

dralley marked this pull request as draft July 10, 2023 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `` values #496

Allow `` values #496

sashka commented Oct 26, 2022

dralley commented Oct 26, 2022

codecov-commenter commented Oct 26, 2022

sashka commented Oct 26, 2022 •

edited

dralley commented Oct 26, 2022 •

edited

sashka commented Oct 26, 2022 •

edited

dralley commented Oct 27, 2022

Mingun commented Oct 27, 2022

dralley commented Oct 27, 2022 •

edited

Mingun commented Oct 27, 2022 •

edited

Mingun commented Oct 27, 2022

sashka commented Oct 27, 2022 •

edited by Mingun

dralley commented Oct 27, 2022

sashka commented Oct 27, 2022 •

edited by dralley

dralley commented Oct 28, 2022 •

edited

dralley commented Nov 9, 2022

sashka commented Nov 10, 2022

dralley commented Jul 10, 2023

Allow &#0; values #496

Are you sure you want to change the base?

Allow &#0; values #496

Conversation

sashka commented Oct 26, 2022

dralley commented Oct 26, 2022

codecov-commenter commented Oct 26, 2022

Codecov Report

sashka commented Oct 26, 2022 • edited

dralley commented Oct 26, 2022 • edited

sashka commented Oct 26, 2022 • edited

dralley commented Oct 27, 2022

Mingun commented Oct 27, 2022

dralley commented Oct 27, 2022 • edited

Mingun commented Oct 27, 2022 • edited

Mingun commented Oct 27, 2022

sashka commented Oct 27, 2022 • edited by Mingun

dralley commented Oct 27, 2022

sashka commented Oct 27, 2022 • edited by dralley

dralley commented Oct 28, 2022 • edited

dralley commented Nov 9, 2022

sashka commented Nov 10, 2022

dralley commented Jul 10, 2023

Allow `` values #496

Allow `` values #496

sashka commented Oct 26, 2022 •

edited

dralley commented Oct 26, 2022 •

edited

sashka commented Oct 26, 2022 •

edited

dralley commented Oct 27, 2022 •

edited

Mingun commented Oct 27, 2022 •

edited

sashka commented Oct 27, 2022 •

edited by Mingun

sashka commented Oct 27, 2022 •

edited by dralley

dralley commented Oct 28, 2022 •

edited