New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to set desired level of escaping special characters in XML #685
Conversation
/// | Character | Replacement | ||
/// |-----------|------------ | ||
/// | `<` | `<` | ||
/// | `&` | `&` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, IIRC, the reason >
is often escaped despite the fact that the spec doesn't specifically require it is a particular special case:
The right angle bracket (>) may be represented using the string " > ", and MUST, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".
...
CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
https://www.w3.org/TR/2008/REC-xml-20081126/#syntax
Escaping >
even when not strictly required seems preferable to implementing it exactly as described, from both from a complexity and performance perspective. Searching for a sequence of characters (]]>
) in every bit of text to be escaped would be expensive compared to just searching for the single byte. (I suppose you could search for >
and then check the previous characters, but it still introduces complication).
It is possible that it is such a niche situation as to be not all that important in practice, meaning that a simple and non-rigorous implementation of minimal escaping would be fine 99.999% of the time.
I am not sure I would make it the default though. Personally I'd rather the default be partial or complete escaping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing I just realized, looking at some examples on StackOverflow from this post:
https://stackoverflow.com/questions/1091945/what-characters-do-i-need-to-escape-in-xml-documents
We currently re-use BytesText
for event types such as Comment
and PI
Text(BytesText<'a>),
/// Unescaped character data stored in `<![CDATA[...]]>`.
CData(BytesCData<'a>),
/// Comment `<!-- ... -->`.
Comment(BytesText<'a>),
/// XML declaration `<?xml ...?>`.
Decl(BytesDecl<'a>),
/// Processing instruction `<?...?>`.
PI(BytesText<'a>),
/// Document type definition data (DTD) stored in `<!DOCTYPE ...>`.
DocType(BytesText<'a>),
But the special characters are intended to not be escaped in those contexts. That may be a UX problem for our users, if they use the default constructors of BytesText
which do perform escaping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I plan to introduce other types for events that does not require escaping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think re: first comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an impression, that Java XML library that was part of SDK, uses minimal escape rules, but now I checked -- it uses partial escape rules (&
, <
and >
are escaped only). I'll change default to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I'm aware, libxml2
always escapes >
as well.
1fe8a15
to
d9de2d8
Compare
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #685 +/- ##
==========================================
+ Coverage 65.31% 65.47% +0.16%
==========================================
Files 38 38
Lines 17920 18025 +105
==========================================
+ Hits 11704 11802 +98
- Misses 6216 6223 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably ought to at least document that if a user selects QuoteLevel::Minimal
they will need to avoid having the sequence ]]>
in any of their provided values, or else they will generate invalid XML. Otherwise it will be a rare but dangerous footgun.
|
Fixes #362