Nokogiri should use Gumbo/HTML5 by default on supported platforms #2331

flavorjones · 2021-09-27T19:33:14Z

This issue is a placeholder for work to be done to use the HTML5 parsing engine by default on platforms where it's supported (meaning, not-JRuby).

Specifically this probably means that when the HTML5 module exists ...

Nokogiri::HTML() should proxy to Nokogiri::HTML5()
Nokogiri.parse() should proxy to Nokogiri::HTML5.parse()
bin/nokogiri should support html4 and html5 options, and the existing html option should proxy to html5

There may be other behaviors we want to switch to the HTML5 parser as well.

Let's also please make sure to do some benchmarks before changing the default behavior. In particular this would be document and fragment parsing, as well as any CSS selectors that are conditionally translated (see #2376).

Pre-work:

The text was updated successfully, but these errors were encountered:

flavorjones · 2021-12-25T05:40:14Z

Moving this out one release to make sure the CSS query hacking is stable.

html5 subclassing --- **What problem is this PR intended to solve?** See #2331 for context. I want to start getting things in place to make it possible to seamlessly switch to HTML5 parsing by default on supported platform. Part of this will require subclassing behavior to work properly (i.e., as Loofah expects it to, where a subclass of Nokogiri::HTML5::Document will return the appropriate subclass from `.parse`). This PR introduces that subclassing behavior, and makes all the HTML4 tests explicitly use `HTML4` instead of `HTML`. Note that `Gumbo.parse` now takes an additional argument, which is the class that should be used for the new document. `Gumbo.parse` is considered to be an internal-only API and so this shouldn't be a breaking change, but it might be worth mentioning in release notes just in case. **Have you included adequate test coverage?** Yes, additional coverage has been added to `test/html5/test_api.rb` **Does this change affect the behavior of either the C or the Java implementations?** HTML5 only exists in the CRuby implementation

flavorjones · 2022-06-05T17:20:39Z

A baby step we'll do first is: support subclassing in v1.14.0 to enable Loofah and Rails::Html::Sanitizer to default to HTML5:

So I'm pushing this out to v1.15.0

flavorjones · 2022-06-05T21:03:52Z

See #2569 for one performance concern that we should benchmark and address.

Update: it's been addressed.

flavorjones · 2022-11-16T19:06:27Z

It may be worth testing upstream Capybara before releasing this change, as Capybara has some logic for toggling between HTML4 and HTML5 parsers already.

zyc9012 · 2022-11-25T10:52:24Z

Nokogiri::HTML5 is slow on initialization as well.

[29] pry(main)> html = File.read('big_shopping.html')
[30] pry(main)> Nokogiri::VERSION
=> "1.13.9"
[31] pry(main)> Benchmark.ms { Nokogiri::HTML4(html) }
=> 54.85699977725744
[32] pry(main)> Benchmark.ms { Nokogiri::HTML5(html) }
=> 227.6209993287921

Tested HTML: big_shopping.html.zip

stevecheckoway · 2022-11-25T16:33:43Z

We should profile and figure out where it is spending the majority of its time.

One thing that could probably be improved is turning the part of the state machine that reads characters into a buffer into a loop rather than needing to do the state machine dispatch per character.

This potentially includes reading tag names and just reading text.

The state machine is complicated though and we’d need to look at the tests closely to make sure they adequately cover such a change.

flavorjones · 2022-12-08T06:09:05Z

@stevecheckoway here's some profiling info on what the parser is doing when parsing big_shopping.html (generated with gperftools cpu profiler):

      93  11.7%  11.7%       93  11.7% decode (inline)
      81  10.2%  21.9%      640  80.4% gumbo_parse_with_options
      74   9.3%  31.2%       75   9.4% pthread_attr_setschedparam
      56   7.0%  38.2%       56   7.0% _init@3e000
      41   5.2%  43.3%      331  41.6% gumbo_lex
      40   5.0%  48.4%      135  17.0% read_char
      35   4.4%  52.8%       52   6.5% gumbo_string_buffer_append_codepoint
      33   4.1%  56.9%      181  22.7% handle_token (inline)
      28   3.5%  60.4%       28   3.5% get_current_node.isra.0
      20   2.5%  62.9%      149  18.7% finish_token.isra.0
      17   2.1%  65.1%       65   8.2% insert_text_token.isra.0
      16   2.0%  67.1%       16   2.0% get_adjusted_current_node
      15   1.9%  69.0%      170  21.4% emit_char
      14   1.8%  70.7%       14   1.8% atomic_sub_nounderflow (inline)
      14   1.8%  72.5%       14   1.8% xmlStrdup (inline)
      12   1.5%  74.0%       12   1.5% gumbo_tokenizer_set_is_adjusted_current_node_foreign
      11   1.4%  75.4%       14   1.8% handle_text
      10   1.3%  76.6%       17   2.1% maybe_resize_string_buffer
       9   1.1%  77.8%        9   1.1% update_position (inline)
       8   1.0%  78.8%       26   3.3% __libc_malloc
...

I won't attempt to analyze that today, but the amount of time spent in pthread_attr_setschedparam makes me scratch my head, as does the _init entry. Anything in particular you want me to dig into?

stevecheckoway · 2022-12-08T19:29:15Z

Some of that is quite odd. The fact that gumbo_tokenizer_set_is_adjusted_current_node_foreign is showing up at all is surprising.

I just looked at the assembly expecting that the code would be essentially be a few loads and a store. This is the only line of the function that does anything:

  parser->_tokenizer_state->_is_adjusted_current_node_foreign = is_foreign;

The assembly that is being emitted contains calls to the empty gumbo_debug function. I've got a simple fix for that:

diff --git a/gumbo-parser/src/util.c b/gumbo-parser/src/util.c
index d1ab2d7a..6238c296 100644
--- a/gumbo-parser/src/util.c
+++ b/gumbo-parser/src/util.c
@@ -63,6 +63,4 @@ void gumbo_debug(const char* format, ...) {
   va_end(args);
   fflush(stdout);
 }
-#else
-void gumbo_debug(const char* UNUSED_ARG(format), ...) {}
 #endif
diff --git a/gumbo-parser/src/util.h b/gumbo-parser/src/util.h
index dfdf465b..5c6ddd8c 100644
--- a/gumbo-parser/src/util.h
+++ b/gumbo-parser/src/util.h
@@ -21,7 +21,11 @@ void* gumbo_realloc(void* ptr, size_t size) RETURNS_NONNULL;
 void gumbo_free(void* ptr);

 // Debug wrapper for printf
+#ifdef GUMBO_DEBUG
 void gumbo_debug(const char* format, ...) PRINTF(1);
+#else
+static inline void gumbo_debug(const char* UNUSED_ARG(format), ...) PRINTF(1) {}
+#endif

 #ifdef __cplusplus
 }

After that change, the emitted code is

                     _gumbo_tokenizer_set_is_adjusted_current_node_foreign:
0000000000137474         ldr        x8, [x0, #0x10]                             ; CODE XREF=_gumbo_parse_with_options+908
0000000000137478         strb       w1, [x8, #0x5]
000000000013747c         ret

But even that could be inlined with link-time optimization. (In fact, doing a link time optimization and also explicitly setting the symbols that are exported from the nokogiri.{bundle,so,dll} is likely to have a measurable performance win. Setting which symbols are exported should speed up program startup times in particular.

Those two seem like easy wins although setting which symbols are exported may impact downstream projects that rely on the Nokogiri C extension's symbols (like Nokogumbo did). Unfortunately, the 2-pass procedure where we first build a DOM tree using Gumbo's data structures and then build a DOM tree using libxml2's data structures does mean we have some essentially unavoidable overhead. Maybe gumbo could be modified to build a libxml2 DOM itself. That's likely a significant undertaking.

flavorjones · 2022-12-11T21:31:58Z

I'm going to open up a new issue to dive into some these optimizations, since this issue is related but not specific to performance. Let's move the performance conversation to #2722.

flavorjones added the topic/HTML5 label Sep 27, 2021

flavorjones added this to the v1.13.0 milestone Sep 27, 2021

flavorjones modified the milestones: v1.13.0, v1.14.0 Dec 25, 2021

flavorjones mentioned this issue May 8, 2022

html5 subclassing #2534

Merged

flavorjones mentioned this issue May 30, 2022

[draft] default to html5 parsing flavorjones/loofah#239

Closed

5 tasks

flavorjones modified the milestones: v1.14.0, v1.15.0 Jun 5, 2022

flavorjones mentioned this issue Jun 5, 2022

explore ways to speed up HTML5 document serialization #2569

Closed

This was referenced Dec 9, 2022

[build bug] Modifying gumbo source doesn't cause rebuilds #2718

Open

explore further optimizing the HTML5 parser and serializer #2722

Open

flavorjones modified the milestones: v1.15.0, v1.16.0 Apr 28, 2023

flavorjones modified the milestones: v1.16.0, v1.16.x patch releases, v1.17.0 Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nokogiri should use Gumbo/HTML5 by default on supported platforms #2331

Nokogiri should use Gumbo/HTML5 by default on supported platforms #2331

flavorjones commented Sep 27, 2021 •

edited

flavorjones commented Dec 25, 2021

flavorjones commented Jun 5, 2022

flavorjones commented Jun 5, 2022 •

edited

flavorjones commented Nov 16, 2022

zyc9012 commented Nov 25, 2022 •

edited

stevecheckoway commented Nov 25, 2022

flavorjones commented Dec 8, 2022

stevecheckoway commented Dec 8, 2022

flavorjones commented Dec 11, 2022

Nokogiri should use Gumbo/HTML5 by default on supported platforms #2331

Nokogiri should use Gumbo/HTML5 by default on supported platforms #2331

Comments

flavorjones commented Sep 27, 2021 • edited

flavorjones commented Dec 25, 2021

flavorjones commented Jun 5, 2022

flavorjones commented Jun 5, 2022 • edited

flavorjones commented Nov 16, 2022

zyc9012 commented Nov 25, 2022 • edited

stevecheckoway commented Nov 25, 2022

flavorjones commented Dec 8, 2022

stevecheckoway commented Dec 8, 2022

flavorjones commented Dec 11, 2022

flavorjones commented Sep 27, 2021 •

edited

flavorjones commented Jun 5, 2022 •

edited

zyc9012 commented Nov 25, 2022 •

edited