How would I parse character references as literal bytes and not codepoints? #667

Dekkonot · 2023-10-16T04:03:38Z

I have an element like this:

<element>&#240;&#159;&#152;&#131;</element>

If those characters are literally interpreted, they should be the byte sequence f0 9f 98 83, which should be U+1F603, or 😃. Instead, it expands to c3 b0 c2 9f c2 98 c2 83 (this sequence is not printable, but you may inspect it here).

This is very much how this is meant to work, and I am aware of that. Unfortunately this decision wasn't made nor is it controlled by me. So, I'd like to know if there's an obvious way to change how escapes are done without having to do it by just iterating through the bytes returned by a Text event.

The text was updated successfully, but these errors were encountered:

Mingun · 2023-10-16T05:19:53Z

I may be wrong, but it seems that you should use

<element>&#x1F603;</element>

instead. Character references are supposed to refer to the Unicode code points directly, not to bytes in some unspecified encoding. A non-normative confirmation of this can be found, for example, here (just the first site from Google), HTML entity for the U+1F603 is 😃

Dekkonot · 2023-10-16T05:26:18Z

Right, that is what I would do if given the opportunity. Unfortunately the program that generates these doesn't do it right and I'm left trying to parse it correctly.

I'm filing a bug report with them, but it could take however long to get fixed if it ever does and in the meantime I still have to parse their files.

Mingun · 2023-10-16T08:14:05Z

Then it seems that it just writes UTF-8 encoded byte arrays for some characters and that byte arrays are encoded as lists of character references. You have to decode the string yourself. Get the raw data using .into_inner() (note, that this bytes may be need to decode first using reader.decoder() if you use non-utf-8 encoding) and convert it to a string by yourself. You will need to copy and modify implementation of unescape

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How would I parse character references as literal bytes and not codepoints? #667

How would I parse character references as literal bytes and not codepoints? #667

Dekkonot commented Oct 16, 2023

Mingun commented Oct 16, 2023

Dekkonot commented Oct 16, 2023

Mingun commented Oct 16, 2023 •

edited

How would I parse character references as literal bytes and not codepoints? #667

How would I parse character references as literal bytes and not codepoints? #667

Comments

Dekkonot commented Oct 16, 2023

Mingun commented Oct 16, 2023

Dekkonot commented Oct 16, 2023

Mingun commented Oct 16, 2023 • edited

Mingun commented Oct 16, 2023 •

edited