Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to set desired level of escaping special characters in XML #685

Merged
merged 3 commits into from Nov 25, 2023

Conversation

Mingun
Copy link
Collaborator

@Mingun Mingun commented Nov 22, 2023

Fixes #362

@Mingun Mingun requested a review from dralley November 22, 2023 16:56
@Mingun Mingun added enhancement serde Issues related to mapping from Rust types to XML labels Nov 22, 2023
/// | Character | Replacement
/// |-----------|------------
/// | `<` | `&lt;`
/// | `&` | `&amp;`
Copy link
Collaborator

@dralley dralley Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, IIRC, the reason > is often escaped despite the fact that the spec doesn't specifically require it is a particular special case:

The right angle bracket (>) may be represented using the string " > ", and MUST, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".

...

CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

https://www.w3.org/TR/2008/REC-xml-20081126/#syntax

Escaping > even when not strictly required seems preferable to implementing it exactly as described, from both from a complexity and performance perspective. Searching for a sequence of characters (]]>) in every bit of text to be escaped would be expensive compared to just searching for the single byte. (I suppose you could search for > and then check the previous characters, but it still introduces complication).

It is possible that it is such a niche situation as to be not all that important in practice, meaning that a simple and non-rigorous implementation of minimal escaping would be fine 99.999% of the time.

I am not sure I would make it the default though. Personally I'd rather the default be partial or complete escaping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing I just realized, looking at some examples on StackOverflow from this post:

https://stackoverflow.com/questions/1091945/what-characters-do-i-need-to-escape-in-xml-documents

We currently re-use BytesText for event types such as Comment and PI

    Text(BytesText<'a>),
    /// Unescaped character data stored in `<![CDATA[...]]>`.
    CData(BytesCData<'a>),
    /// Comment `<!-- ... -->`.
    Comment(BytesText<'a>),
    /// XML declaration `<?xml ...?>`.
    Decl(BytesDecl<'a>),
    /// Processing instruction `<?...?>`.
    PI(BytesText<'a>),
    /// Document type definition data (DTD) stored in `<!DOCTYPE ...>`.
    DocType(BytesText<'a>),

But the special characters are intended to not be escaped in those contexts. That may be a UX problem for our users, if they use the default constructors of BytesText which do perform escaping.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I plan to introduce other types for events that does not require escaping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think re: first comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an impression, that Java XML library that was part of SDK, uses minimal escape rules, but now I checked -- it uses partial escape rules (&, < and > are escaped only). I'll change default to that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm aware, libxml2 always escapes > as well.

@codecov-commenter
Copy link

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (9fb181a) 65.31% compared to head (d9de2d8) 65.47%.

Files Patch % Lines
src/events/mod.rs 0.00% 7 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #685      +/-   ##
==========================================
+ Coverage   65.31%   65.47%   +0.16%     
==========================================
  Files          38       38              
  Lines       17920    18025     +105     
==========================================
+ Hits        11704    11802      +98     
- Misses       6216     6223       +7     
Flag Coverage Δ
unittests 65.47% <93.63%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably ought to at least document that if a user selects QuoteLevel::Minimal they will need to avoid having the sequence ]]> in any of their provided values, or else they will generate invalid XML. Otherwise it will be a rare but dangerous footgun.

@Mingun
Copy link
Collaborator Author

Mingun commented Nov 25, 2023

]]> is allowed everywhere except CDATA content. We already have a warning about that.

@Mingun Mingun merged commit c383849 into tafia:master Nov 25, 2023
6 checks passed
@Mingun Mingun deleted the quoting-level branch November 25, 2023 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement serde Issues related to mapping from Rust types to XML
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to ignore double quotation mark when do serialization?
3 participants