Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for escaping characters " \ #167

Open
HookedBehemoth opened this issue Sep 10, 2022 · 9 comments
Open

API for escaping characters " \ #167

HookedBehemoth opened this issue Sep 10, 2022 · 9 comments

Comments

@HookedBehemoth
Copy link

I'm using this library to convert utf16le to utf8 and place it inside a JSON string.
Currently this library has no way of escaping the special characters " and \ which I'd need.
Is there any interest in getting this upstream?

@lemire
Copy link
Member

lemire commented Sep 11, 2022

It is a reasonable request.

@HookedBehemoth
Copy link
Author

HookedBehemoth commented Sep 11, 2022

So far I only fleshed out this for utf16 to utf8. Do you think there a branchless way for this?

const __m128i escape_bs = _mm_set1_epi8('\\');
const __m128i escape_ap = _mm_set1_epi8('\"');
const __m128i escape_mask = _mm_or_si128(
    _mm_cmpeq_epi8(utf8_packed, escape_bs),
    _mm_cmpeq_epi8(utf8_packed, escape_ap)
);

int mask = _mm_movemask_epi8(escape_mask);
if (simdutf_unlikely(mask)) {
    for (size_t i = 0; i < 16; ++i) {
        if ((mask >> i) & 1) {
            *utf8_output++ = '\\';
        }
        *utf8_output++ = char(*buf++);
    }
    continue;
}

It causes quite the big performance drop for english (~84%) and esperanto (~50%).
before.txt
after.txt

@lemire
Copy link
Member

lemire commented Sep 11, 2022

Thanks for the draft. @clausecker : do you want to take a look? It is a fun and practical problem.

@clausecker
Copy link
Collaborator

@HookedBehemoth Sure, I can give it a try. What SIMD extension are you most interested in? I assume you are also looking to have newlines and NUL characters escaped?

@HookedBehemoth
Copy link
Author

HookedBehemoth commented Sep 12, 2022

I only have an AVX2 capable CPU so that's that.
While the json spec wants all of " \ / \b \f \n \r \t, I'd really only need the first two. The others either aren't used, exist anymore or are tolerated by my target deserializer (Chromium/CEF).
I also don't really expect NUL characters.

@clausecker
Copy link
Collaborator

@HookedBehemoth Well these are the characters that can be escaped with a named escape sequence. You can actually escape almost every characters you like using the \uXXXX escape sequence. My question is because JSON strings with embedded newlines cause problems with some parsers as the underlying language (Javascript) does not support newlines in strings as far as I know. So it's a good idea to escape them. NUL bytes can cause a problem because C doesn't support them in strings. However, \0 does not seem to be part of the spec, so not sure.

Doing this in AVX2 is a bit tricky due to a lack of compress/expand, but I think it might be doable. It would be independent of character set conversion though; you'd just do it as an extra step afterwards if you need it.

@lemire
Copy link
Member

lemire commented Sep 12, 2022

I think Robert is correct. The JSON specification requires that a whole range of characters be escaped (it is a requirement):

All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

@HookedBehemoth
Copy link
Author

Ah I see now.
Doing it in an extra step wouldn't be desirable for me I think. I don't currently have any spare memory and checking after every X amount of characters would slow it down too much I think.

@HookedBehemoth
Copy link
Author

HookedBehemoth commented Sep 14, 2022

I went with the array approach rapidjson uses. https://github.com/Tencent/rapidjson/blob/06d58b9e848c650114556a23294d0b6440078c61/include/rapidjson/writer.h#L380-L388
Unsurprisingly, this is very slow, taking up ~61% of the execution time, according to AMDuProf.

Firefox' SpiderMonkey & rapidjson use such arrays
Chromium's v8 & serentiy's LibJS use a switch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants