New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API for escaping characters " \ #167
Comments
It is a reasonable request. |
So far I only fleshed out this for utf16 to utf8. Do you think there a branchless way for this? const __m128i escape_bs = _mm_set1_epi8('\\');
const __m128i escape_ap = _mm_set1_epi8('\"');
const __m128i escape_mask = _mm_or_si128(
_mm_cmpeq_epi8(utf8_packed, escape_bs),
_mm_cmpeq_epi8(utf8_packed, escape_ap)
);
int mask = _mm_movemask_epi8(escape_mask);
if (simdutf_unlikely(mask)) {
for (size_t i = 0; i < 16; ++i) {
if ((mask >> i) & 1) {
*utf8_output++ = '\\';
}
*utf8_output++ = char(*buf++);
}
continue;
} It causes quite the big performance drop for english (~84%) and esperanto (~50%). |
Thanks for the draft. @clausecker : do you want to take a look? It is a fun and practical problem. |
@HookedBehemoth Sure, I can give it a try. What SIMD extension are you most interested in? I assume you are also looking to have newlines and NUL characters escaped? |
I only have an AVX2 capable CPU so that's that. |
@HookedBehemoth Well these are the characters that can be escaped with a named escape sequence. You can actually escape almost every characters you like using the Doing this in AVX2 is a bit tricky due to a lack of compress/expand, but I think it might be doable. It would be independent of character set conversion though; you'd just do it as an extra step afterwards if you need it. |
I think Robert is correct. The JSON specification requires that a whole range of characters be escaped (it is a requirement): All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). |
Ah I see now. |
I went with the array approach rapidjson uses. https://github.com/Tencent/rapidjson/blob/06d58b9e848c650114556a23294d0b6440078c61/include/rapidjson/writer.h#L380-L388 Firefox' SpiderMonkey & rapidjson use such arrays |
I'm using this library to convert utf16le to utf8 and place it inside a JSON string.
Currently this library has no way of escaping the special characters
"
and\
which I'd need.Is there any interest in getting this upstream?
The text was updated successfully, but these errors were encountered: