Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HTML 5 serialization in C #2596

Merged
merged 1 commit into from Jul 19, 2022
Merged

Implement HTML 5 serialization in C #2596

merged 1 commit into from Jul 19, 2022

Conversation

stevecheckoway
Copy link
Contributor

HTML 5 serialization was previously done entirely in Ruby.
The Ruby code is slow. This reimplements the serialization in C.

Reencoding happens after UTF-8 serialization.

Fixes: #2569

What problem is this PR intended to solve?

#2569

Have you included adequate test coverage?

Not yet.

Does this change affect the behavior of either the C or the Java implementations?

It should speed up the serialization of HTML 5 without a change in behavior.

@stevecheckoway
Copy link
Contributor Author

I'm unsure where the downstream error TypeError: wrong argument type Nokogiri::XML::Text (expected Data) comes from.

@stevecheckoway
Copy link
Contributor Author

I can't reproduce this locally.
2401 tests, 8521 assertions, 0 failures, 0 errors, 17 skips

I guess I'll try on a Linux VM…but not tonight.

@flavorjones
Copy link
Member

@stevecheckoway Thanks for doing this!

The error you're seeing is because Github will rebase the PR onto current main when running the Github Actions pipelines. As a result, this changeset is running on top of the commits from #2579 which introduce typed data structures, and the use of Data_Get_Struct in this changeset conflicts with the Noko_Node_Get_Struct macro on main.

I'm going to explicitly rebase this and update the get-struct macros, and then it should go green.

@flavorjones
Copy link
Member

@stevecheckoway Looks like the downstream sanitize test suite is segfaulting: https://github.com/sparklemotion/nokogiri/runs/7392514046?check_suite_focus=true

@flavorjones
Copy link
Member

I ran valgrind on the sanitize tests using nokogiri from this PR, and it says:

==132564== Memcheck, a memory error detector
==132564== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==132564== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==132564== Command: /home/flavorjones/.rbenv/versions/3.1.2/bin/ruby test/test_sanitize.rb
==132564== 
==132564== Warning: client switching stacks?  SP change: 0x1ffe8020e0 --> 0x1ffeffe850
==132564==          to suppress, use: --max-stackframe=8374128 or greater
==132564== Invalid read of size 1
==132564==    at 0xBFE7F60: is_one_of (xml_node.c:1866)
==132564==    by 0xBFE7F60: output_node (xml_node.c:1929)
==132564==    by 0xBFE8652: html_standard_serialize (xml_node.c:1985)
==132564==    by 0x4B03797: vm_call_cfunc_with_frame (vm_insnhelper.c:3037)
==132564==    by 0x4B18848: vm_call_method_each_type (vm_insnhelper.c:3639)
==132564==    by 0x4B19113: vm_call_method (vm_insnhelper.c:3750)
==132564==    by 0x4B11A41: vm_sendish (vm_insnhelper.c:4751)
==132564==    by 0x4B11A41: vm_exec_core (insns.def:778)
==132564==    by 0x4B17622: rb_vm_exec (vm.c:2211)
...

Looks like name is NULL for TEXT nodes

@flavorjones
Copy link
Member

Oh, no, it's actually that fragments don't have a node name.

@flavorjones
Copy link
Member

@stevecheckoway I pushed a commit, would love your feedback.

@flavorjones
Copy link
Member

Some rough benchmarks:

#! /usr/bin/env ruby
# coding: utf-8
# frozen_string_literal: true

require "nokogiri"
require "benchmark/ips"

input = File.read("test/files/tlm.html")
doc = Nokogiri::HTML5::Document.parse(input)
puts "input #{input.length} → output #{doc.to_html.length}"

Benchmark.ips do |x|
  x.time = 10
  x.report do
    doc.to_html
  end
end
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) +YJIT [x86_64-linux] C impl:
      827.5 i/s
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) +YJIT [x86_64-linux]:
       93.9 i/s - 8.81x  (± 0.00) slower
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]:
       77.0 i/s - 10.75x  (± 0.00) slower

so pretty good 10x improvement, @stevecheckoway !!!

@flavorjones
Copy link
Member

flavorjones commented Jul 18, 2022

I think I'd like to squash these commits but this approach looks good. My only changes were using RTEST and checking for NULL node names.

HTML 5 serialization was previously done entirely in Ruby.
The Ruby code is slow. This reimplements the serialization in C.

Reencoding happens after UTF-8 serialization.

This is about 10x faster:

```
C - ruby 3.2.0dev (2022-07-18T21:06:30Z master 85ea46730d) [x86_64-linux]:
      848.4 i/s
C - ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]:
      812.0 i/s - same-ish: difference falls within error
ruby - ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) +YJIT [x86_64-linux]:
       86.3 i/s - 9.83x  (± 0.00) slower
ruby - ruby 3.2.0dev (2022-07-18T21:06:30Z master 85ea46730d) +YJIT [x86_64-linux]:
       82.9 i/s - 10.24x  (± 0.00) slower
ruby - ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]:
       80.4 i/s - 10.55x  (± 0.00) slower
ruby - ruby 3.2.0dev (2022-07-18T21:06:30Z master 85ea46730d) [x86_64-linux]:
       74.7 i/s - 11.36x  (± 0.00) slower
```

Fixes: #2569

Co-authored-by: Mike Dalessio <mike.dalessio@gmail.com>
@flavorjones
Copy link
Member

C - ruby 3.2.0dev (2022-07-18T21:06:30Z master 85ea46730d) [x86_64-linux]:
      848.4 i/s
C - ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]:
      812.0 i/s - same-ish: difference falls within error
ruby - ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) +YJIT [x86_64-linux]:
       86.3 i/s - 9.83x  (± 0.00) slower
ruby - ruby 3.2.0dev (2022-07-18T21:06:30Z master 85ea46730d) +YJIT [x86_64-linux]:
       82.9 i/s - 10.24x  (± 0.00) slower
ruby - ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]:
       80.4 i/s - 10.55x  (± 0.00) slower
ruby - ruby 3.2.0dev (2022-07-18T21:06:30Z master 85ea46730d) [x86_64-linux]:
       74.7 i/s - 11.36x  (± 0.00) slower

@flavorjones
Copy link
Member

I've squashed all the commits and updated the commit message with benchmark data. When this is green, I'd like to merge.

@flavorjones flavorjones marked this pull request as ready for review July 19, 2022 15:22
@flavorjones
Copy link
Member

@stevecheckoway Any last-minute thoughts on this? Feel free to whack the "merge" button.

@stevecheckoway
Copy link
Contributor Author

Your changes look good. Hopefully people can give it a test drive and shake out any bugs before the next release.

@stevecheckoway stevecheckoway merged commit 8220dc7 into sparklemotion:main Jul 19, 2022
@stevecheckoway stevecheckoway deleted the serialize-html5 branch July 19, 2022 20:21
@flavorjones
Copy link
Member

woo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

explore ways to speed up HTML5 document serialization
2 participants