Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

How would I parse character references as literal bytes and not codepoints? #667

Open
Dekkonot opened this issue Oct 16, 2023 · 3 comments

Comments

@Dekkonot
Copy link

I have an element like this:

<element>&#240;&#159;&#152;&#131;</element>

If those characters are literally interpreted, they should be the byte sequence f0 9f 98 83, which should be U+1F603, or 馃槂. Instead, it expands to c3 b0 c2 9f c2 98 c2 83 (this sequence is not printable, but you may inspect it here).

This is very much how this is meant to work, and I am aware of that. Unfortunately this decision wasn't made nor is it controlled by me. So, I'd like to know if there's an obvious way to change how escapes are done without having to do it by just iterating through the bytes returned by a Text event.

@Mingun
Copy link
Collaborator

Mingun commented Oct 16, 2023

I may be wrong, but it seems that you should use

<element>&#x1F603;</element>

instead. Character references are supposed to refer to the Unicode code points directly, not to bytes in some unspecified encoding. A non-normative confirmation of this can be found, for example, here (just the first site from Google), HTML entity for the U+1F603 is &#x1F603;

@Dekkonot
Copy link
Author

Right, that is what I would do if given the opportunity. Unfortunately the program that generates these doesn't do it right and I'm left trying to parse it correctly.

I'm filing a bug report with them, but it could take however long to get fixed if it ever does and in the meantime I still have to parse their files.

@Mingun
Copy link
Collaborator

Mingun commented Oct 16, 2023

Then it seems that it just writes UTF-8 encoded byte arrays for some characters and that byte arrays are encoded as lists of character references. You have to decode the string yourself. Get the raw data using .into_inner() (note, that this bytes may be need to decode first using reader.decoder() if you use non-utf-8 encoding) and convert it to a string by yourself. You will need to copy and modify implementation of unescape

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants