Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process HTML using Nokolexbor instead of Nokogumbo #3043

Open
ilyazub opened this issue Nov 28, 2023 · 0 comments
Open

Process HTML using Nokolexbor instead of Nokogumbo #3043

ilyazub opened this issue Nov 28, 2023 · 0 comments
Labels
state/needs-triage Inbox for non-installation-related bug reports or help requests

Comments

@ilyazub
Copy link
Contributor

ilyazub commented Nov 28, 2023

I used the benchmark from #2722 with Ruby 3.2.2 and 2.7.2, and added Nokolexbor to the benchmark.

Nokolexbor is 2-12 times faster when parsing and 2-6 times faster when serializing than the Gumbo and Libxml2 backends.

#! /usr/bin/env ruby
# coding: utf-8

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
  gem "nokolexbor"
  gem "benchmark-ips"
end

require "nokogiri"
require "nokolexbor"
require "benchmark/ips"

filenames = [
  "test/files/GH_1042.html", # 650b
  "test/files/tlm.html", # 70kb
  "big_shopping.html", # 1.9mb
]

inputs = filenames.map { |fn| File.read(fn) }

puts RUBY_DESCRIPTION

inputs.each do |input|
  len = input.length

  Benchmark.ips do |x|
    x.warmup = 0
    x.time = 10

    x.report("html5 parse #{len}") do
      Nokogiri::HTML5::Document.parse(input)
    end
    x.report("html4 parse #{len}") do
      Nokogiri::HTML4::Document.parse(input)
    end
    x.report("nokolexbor html5 parse #{len}") do
      Nokolexbor::HTML(input)
    end
    x.compare!
  end
end

puts "=========="

inputs.each do |input|
  len = input.length
  html4_doc = Nokogiri::HTML4::Document.parse(input)
  html5_doc = Nokogiri::HTML5::Document.parse(input)
  html5_doc_nokolexbor = Nokolexbor::HTML(input)

  Benchmark.ips do |x|
    x.warmup = 0
    x.time = 10

    x.report("html5 serlz #{len}") do
      html5_doc.to_html
    end
    x.report("html4 serlz #{len}") do
      html4_doc.to_html
    end
    x.report("html5 nokolexbor serlz #{len}") do
      html5_doc_nokolexbor.to_html
    end
    x.compare!
  end
end
ruby 2.7.2 benchmark
$ ruby bench.rb
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
Calculating -------------------------------------
     html5 parse 656     21.049k (±23.5%) i/s -    179.547k in   9.929976s
     html4 parse 656     22.142k (±22.3%) i/s -    189.923k in   9.926466s
nokolexbor html5 parse 656
                         43.945k (±21.3%) i/s -    296.049k in   9.900173s

Comparison:
nokolexbor html5 parse 656:    43944.8 i/s
     html4 parse 656:    22141.7 i/s - 1.98x  (± 0.00) slower
     html5 parse 656:    21048.8 i/s - 2.09x  (± 0.00) slower

Calculating -------------------------------------
   html5 parse 70095    300.102  (±18.7%) i/s -      2.684k in   9.997238s
   html4 parse 70095    450.409  (±22.6%) i/s -      3.978k in   9.997504s
nokolexbor html5 parse 70095
                          1.406k (±20.4%) i/s -     13.083k in   9.984839s

Comparison:
nokolexbor html5 parse 70095:     1405.6 i/s
   html4 parse 70095:      450.4 i/s - 3.12x  (± 0.00) slower
   html5 parse 70095:      300.1 i/s - 4.68x  (± 0.00) slower

Calculating -------------------------------------
 html5 parse 1929522     13.132  (± 7.6%) i/s -    131.000  in  10.075865s
 html4 parse 1929522     37.880  (±13.2%) i/s -    370.000  in  10.017928s
nokolexbor html5 parse 1929522
                        157.773  (± 9.5%) i/s -      1.561k in   9.999853s

Comparison:
nokolexbor html5 parse 1929522:      157.8 i/s
 html4 parse 1929522:       37.9 i/s - 4.17x  (± 0.00) slower
 html5 parse 1929522:       13.1 i/s - 12.01x  (± 0.00) slower

==========
Calculating -------------------------------------
     html5 serlz 656     40.303k (±17.2%) i/s -    373.898k in   9.891472s
     html4 serlz 656     53.260k (±18.3%) i/s -    484.973k in   9.844606s
html5 nokolexbor serlz 656
                        263.888k (±15.5%) i/s -      2.270M in   9.493963s

Comparison:
html5 nokolexbor serlz 656:   263887.5 i/s
     html4 serlz 656:    53260.0 i/s - 4.95x  (± 0.00) slower
     html5 serlz 656:    40303.4 i/s - 6.55x  (± 0.00) slower

Calculating -------------------------------------
   html5 serlz 70095    918.855  (±15.1%) i/s -      8.842k in   9.993063s
   html4 serlz 70095      1.112k (±13.3%) i/s -     10.828k in   9.992264s
html5 nokolexbor serlz 70095
                          3.359k (±14.9%) i/s -     32.417k in   9.985435s

Comparison:
html5 nokolexbor serlz 70095:     3358.8 i/s
   html4 serlz 70095:     1112.0 i/s - 3.02x  (± 0.00) slower
   html5 serlz 70095:      918.9 i/s - 3.66x  (± 0.00) slower

Calculating -------------------------------------
 html5 serlz 1929522    107.234  (±12.1%) i/s -      1.055k in  10.007869s
 html4 serlz 1929522    115.701  (±11.2%) i/s -      1.140k in   9.999178s
html5 nokolexbor serlz 1929522
                        425.103  (±19.8%) i/s -      4.042k in   9.994780s

Comparison:
html5 nokolexbor serlz 1929522:      425.1 i/s
 html4 serlz 1929522:      115.7 i/s - 3.67x  (± 0.00) slower
 html5 serlz 1929522:      107.2 i/s - 3.96x  (± 0.00) slower
ruby 3.2.2 benchmark
$ ruby ./bench.rb
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
Calculating -------------------------------------
     html5 parse 656     21.030k (±18.5%) i/s -    170.856k
     html4 parse 656     21.118k (±18.9%) i/s -    172.096k in   9.886192s
nokolexbor html5 parse 656
                         38.215k (±24.8%) i/s -    243.899k in   9.856369s

Comparison:
nokolexbor html5 parse 656:    38214.8 i/s
     html4 parse 656:    21118.4 i/s - 1.81x  slower
     html5 parse 656:    21029.8 i/s - 1.82x  slower

Calculating -------------------------------------
   html5 parse 70095    275.828  (±21.0%) i/s -      2.421k in   9.996074s
   html4 parse 70095    439.891  (±20.9%) i/s -      3.646k in   9.995517s
nokolexbor html5 parse 70095
                          1.467k (±18.5%) i/s -     13.797k in   9.983325s

Comparison:
nokolexbor html5 parse 70095:     1466.9 i/s
   html4 parse 70095:      439.9 i/s - 3.33x  slower
   html5 parse 70095:      275.8 i/s - 5.32x  slower

Calculating -------------------------------------
 html5 parse 1929522     12.321  (± 8.1%) i/s -    122.000  in  10.067774s
 html4 parse 1929522     36.420  (±19.2%) i/s -    351.000  in  10.018349s
nokolexbor html5 parse 1929522
                        146.070  (±15.1%) i/s -      1.423k in  10.001315s

Comparison:
nokolexbor html5 parse 1929522:      146.1 i/s
 html4 parse 1929522:       36.4 i/s - 4.01x  slower
 html5 parse 1929522:       12.3 i/s - 11.86x  slower

==========
Calculating -------------------------------------
     html5 serlz 656     39.037k (±22.6%) i/s -    335.023k in   9.824201s
     html4 serlz 656     52.522k (±21.3%) i/s -    452.027k in   9.742767s
html5 nokolexbor serlz 656
                        260.432k (±19.0%) i/s -      2.064M in   9.155473s

Comparison:
html5 nokolexbor serlz 656:   260432.1 i/s
     html4 serlz 656:    52521.9 i/s - 4.96x  slower
     html5 serlz 656:    39037.3 i/s - 6.67x  slower

Calculating -------------------------------------
   html5 serlz 70095    950.690  (±15.6%) i/s -      9.173k in   9.989867s
   html4 serlz 70095      1.049k (±15.6%) i/s -     10.090k in   9.988001s
html5 nokolexbor serlz 70095
                          3.464k (±16.9%) i/s -     32.979k in   9.976496s

Comparison:
html5 nokolexbor serlz 70095:     3464.2 i/s
   html4 serlz 70095:     1049.5 i/s - 3.30x  slower
   html5 serlz 70095:      950.7 i/s - 3.64x  slower

Calculating -------------------------------------
 html5 serlz 1929522    114.167  (± 9.6%) i/s -      1.130k in  10.002443s
 html4 serlz 1929522    112.654  (±12.4%) i/s -      1.107k in  10.006577s
html5 nokolexbor serlz 1929522
                        412.097  (±18.9%) i/s -      3.934k in   9.992725s

Comparison:
html5 nokolexbor serlz 1929522:      412.1 i/s
 html5 serlz 1929522:      114.2 i/s - 3.61x  slower
 html4 serlz 1929522:      112.7 i/s - 3.66x  slower

@flavorjones, thank you for following up and checking Nokolexbor! ♥️ What incompatibilities did you notice and what do you think about the Lexbor library usage in Nokogiri?

/cc @zyc9012

@ilyazub ilyazub added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state/needs-triage Inbox for non-installation-related bug reports or help requests
Projects
None yet
Development

No branches or pull requests

1 participant