Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explore ways to speed up HTML5 document serialization #2569

Closed
flavorjones opened this issue Jun 5, 2022 · 0 comments · Fixed by #2596
Closed

explore ways to speed up HTML5 document serialization #2569

flavorjones opened this issue Jun 5, 2022 · 0 comments · Fixed by #2596

Comments

@flavorjones
Copy link
Member

Please describe the issue

Originally rubys/nokogumbo#145, please read that issue for an in-depth conversation.

For HTML5 documents, #to_s is much slower than native libxml2 serialization. However, libxml2 gets many things wrong which is why Nokogumbo (and now Nokogiri::HTML5) implemented HTML5.serialize_node_internal.

Some likely next steps:

  • profile the current method to get a sense for how much slower it is (which is an input into the decision of when to default to HTML5)
  • see if we can improve the existing Ruby implementation
  • potentially re-implement the method in C if that would be compellingly faster
@flavorjones flavorjones added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jun 5, 2022
@flavorjones flavorjones added needs/research and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jun 15, 2022
stevecheckoway added a commit that referenced this issue Jul 19, 2022
HTML 5 serialization was previously done entirely in Ruby.
The Ruby code is slow. This reimplements the serialization in C.

Reencoding happens after UTF-8 serialization.

This is about 10x faster:

```
C - ruby 3.2.0dev (2022-07-18T21:06:30Z master 85ea46730d) [x86_64-linux]:
      848.4 i/s
C - ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]:
      812.0 i/s - same-ish: difference falls within error
ruby - ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) +YJIT [x86_64-linux]:
       86.3 i/s - 9.83x  (± 0.00) slower
ruby - ruby 3.2.0dev (2022-07-18T21:06:30Z master 85ea46730d) +YJIT [x86_64-linux]:
       82.9 i/s - 10.24x  (± 0.00) slower
ruby - ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]:
       80.4 i/s - 10.55x  (± 0.00) slower
ruby - ruby 3.2.0dev (2022-07-18T21:06:30Z master 85ea46730d) [x86_64-linux]:
       74.7 i/s - 11.36x  (± 0.00) slower
```

Fixes: #2569

Co-authored-by: Mike Dalessio <mike.dalessio@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant