Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace all CVTUTF code #567

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

LukeShu
Copy link

@LukeShu LukeShu commented Feb 23, 2024

From 1998 to 2007, the Unicode Consortium maintained a library called CVTUTF. In 2009, CVTUTF was removed from unicode.org, and the Unicode Consortium said that every version of CVTUTF had bugs, and that folks should use the ICU library instead.

CVTUTF was under a custom license that was not Free under the FSF's definition, not Open Source under the OSI's definition, and not GPL-compatible. json/ext uses code taken-from/based-on CVTUTF. This has caused much consternation among folks who care about any of those 3 things.

So, I

  1. did some software-archeology to collect all versions of CVTUTF (https://git.lukeshu.com/2git/cvtutf-make/, https://git.lukeshu.com/2git/cvtutf/), so that I could identify precisely which code in json/ext is based on CVTUTF,
  2. deleted that code from json/ext,
  3. got drunk so I'd forget all of the code that I'd just read, and
  4. wrote some new code to replace the deleted code.

I hope that you'll find my version of convert_UTF8_to_JSON to be clearer and more maintainable.

I have not benchmarked it, but I do not expect a significant performance difference. If I had to guess, I'd suspect that my UTF-8 decoder is slightly slower (I use val & const == const in an if/else chain, where I think CVTUTF used a [256]char lookup table), while my JSON encoder is slightly faster (I suspect that by virtue of being simpler the compiler is better able to optimize it).

Fixes #277

I did this based on manual inspection, comparing the code to my re-created
history of CVTUTF at https://git.lukeshu.com/2git/cvtutf/ (created by the
scripts at https://git.lukeshu.com/2git/cvtutf-make/)
I, Luke T. Shumaker, am the sole author of the added code.

I did not reference CVTUTF when writing it.  I did reference the
Unicode standard (15.0.0), the Wikipedia article on UTF-8, and the
Wikipedia article on UTF-16.  When I saw some tests fail, I did
reference the old deleted code (but a JSON-specific part, inherently
not as based on CVTUTF) to determine that script_safe should also
escape U+2028 and U+2029.

I targeted simplicity and clarity when writing the code--it can likely
be optimized.  In my mind, the obvious next optimization is to have it
combine contiguous non-escaped characters into just one call to
fbuffer_append(), instead of calling fbuffer_append() for each
character.

Regarding the use of the "modern" types `uint32_t`, `uint16_t`, and
`bool`:
 - ruby.h is guaranteed to give us uint32_t and uint16_t.
 - Since Ruby 3.0.0, ruby.h is guaranteed to give us bool... but we
   support down to Ruby 2.3.  But, ruby.h is guaranteed to give us
   HAVE_STDBOOL_H for the C99 stdbool.h; so use that to include
   stdbool.h if we can, and if not then fall back to a copy of the
   same bool definition that Ruby 3.0.5 uses with C89.
@hsbt
Copy link
Collaborator

hsbt commented Mar 12, 2024

@LukeShu Thanks for your proposal. I agree to replace this with current implementation.

@voxik
Copy link

voxik commented Mar 26, 2024

@LukeShu Thanks for your proposal. I agree to replace this with current implementation.

What is the plan here? Is it safe to assume you are going to merge this, therefore it should be fine use this patch in Fedora to avoid legal issues?

@hsbt
Copy link
Collaborator

hsbt commented Mar 26, 2024

I didn't review this yet. Don't rush this.

@LukeShu
Copy link
Author

LukeShu commented Apr 5, 2024

Preliminary benchmarking (using your old generator_benchmark.rb) is showing that my version is faster across the board:

2024-04-05-133545_749x128_scrot

@LukeShu
Copy link
Author

LukeShu commented Apr 6, 2024

I'm not sure what was up with those large "before" numbers, but now being more careful with CPU throttling and background processes and cron jobs, I'm seeing much smaller deltas (generally 1-3µs/op or 1-7% improvement; with the exception that generator2_benchmark.rb's ASCII test got a big win). I definitely didn't make things slower.

I am seeing some higher standard deviations though.

2024-04-05-210638_1199x174_scrot

@LukeShu
Copy link
Author

LukeShu commented May 15, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Please update Unicode license blurb
3 participants