Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nokogiri should use Gumbo/HTML5 by default on supported platforms #2331

Open
flavorjones opened this issue Sep 27, 2021 · 9 comments
Open

Nokogiri should use Gumbo/HTML5 by default on supported platforms #2331

flavorjones opened this issue Sep 27, 2021 · 9 comments
Milestone

Comments

@flavorjones
Copy link
Member

flavorjones commented Sep 27, 2021

This issue is a placeholder for work to be done to use the HTML5 parsing engine by default on platforms where it's supported (meaning, not-JRuby).

Specifically this probably means that when the HTML5 module exists ...

  • Nokogiri::HTML() should proxy to Nokogiri::HTML5()
  • Nokogiri.parse() should proxy to Nokogiri::HTML5.parse()
  • bin/nokogiri should support html4 and html5 options, and the existing html option should proxy to html5

There may be other behaviors we want to switch to the HTML5 parser as well.

Let's also please make sure to do some benchmarks before changing the default behavior. In particular this would be document and fragment parsing, as well as any CSS selectors that are conditionally translated (see #2376).

Pre-work:

@flavorjones flavorjones added this to the v1.13.0 milestone Sep 27, 2021
@flavorjones flavorjones modified the milestones: v1.13.0, v1.14.0 Dec 25, 2021
@flavorjones
Copy link
Member Author

Moving this out one release to make sure the CSS query hacking is stable.

flavorjones added a commit that referenced this issue May 9, 2022
html5 subclassing

---

**What problem is this PR intended to solve?**

See #2331 for context. I want to start getting things in place to make it possible to seamlessly switch to HTML5 parsing by default on supported platform. Part of this will require subclassing behavior to work properly (i.e., as Loofah expects it to, where a subclass of Nokogiri::HTML5::Document will return the appropriate subclass from `.parse`).

This PR introduces that subclassing behavior, and makes all the HTML4 tests explicitly use `HTML4` instead of `HTML`.

Note that `Gumbo.parse` now takes an additional argument, which is the class that should be used for the new document. `Gumbo.parse` is considered to be an internal-only API and so this shouldn't be a breaking change, but it might be worth mentioning in release notes just in case.

**Have you included adequate test coverage?**

Yes, additional coverage has been added to `test/html5/test_api.rb`


**Does this change affect the behavior of either the C or the Java implementations?**

HTML5 only exists in the CRuby implementation
@flavorjones
Copy link
Member Author

A baby step we'll do first is: support subclassing in v1.14.0 to enable Loofah and Rails::Html::Sanitizer to default to HTML5:

So I'm pushing this out to v1.15.0

@flavorjones
Copy link
Member Author

flavorjones commented Jun 5, 2022

See #2569 for one performance concern that we should benchmark and address.

Update: it's been addressed.

@flavorjones
Copy link
Member Author

It may be worth testing upstream Capybara before releasing this change, as Capybara has some logic for toggling between HTML4 and HTML5 parsers already.

@zyc9012
Copy link

zyc9012 commented Nov 25, 2022

Nokogiri::HTML5 is slow on initialization as well.

[29] pry(main)> html = File.read('big_shopping.html')
[30] pry(main)> Nokogiri::VERSION
=> "1.13.9"
[31] pry(main)> Benchmark.ms { Nokogiri::HTML4(html) }
=> 54.85699977725744
[32] pry(main)> Benchmark.ms { Nokogiri::HTML5(html) }
=> 227.6209993287921

Tested HTML: big_shopping.html.zip

@stevecheckoway
Copy link
Contributor

We should profile and figure out where it is spending the majority of its time.

One thing that could probably be improved is turning the part of the state machine that reads characters into a buffer into a loop rather than needing to do the state machine dispatch per character.

This potentially includes reading tag names and just reading text.

The state machine is complicated though and we’d need to look at the tests closely to make sure they adequately cover such a change.

@flavorjones
Copy link
Member Author

@stevecheckoway here's some profiling info on what the parser is doing when parsing big_shopping.html (generated with gperftools cpu profiler):

      93  11.7%  11.7%       93  11.7% decode (inline)
      81  10.2%  21.9%      640  80.4% gumbo_parse_with_options
      74   9.3%  31.2%       75   9.4% pthread_attr_setschedparam
      56   7.0%  38.2%       56   7.0% _init@3e000
      41   5.2%  43.3%      331  41.6% gumbo_lex
      40   5.0%  48.4%      135  17.0% read_char
      35   4.4%  52.8%       52   6.5% gumbo_string_buffer_append_codepoint
      33   4.1%  56.9%      181  22.7% handle_token (inline)
      28   3.5%  60.4%       28   3.5% get_current_node.isra.0
      20   2.5%  62.9%      149  18.7% finish_token.isra.0
      17   2.1%  65.1%       65   8.2% insert_text_token.isra.0
      16   2.0%  67.1%       16   2.0% get_adjusted_current_node
      15   1.9%  69.0%      170  21.4% emit_char
      14   1.8%  70.7%       14   1.8% atomic_sub_nounderflow (inline)
      14   1.8%  72.5%       14   1.8% xmlStrdup (inline)
      12   1.5%  74.0%       12   1.5% gumbo_tokenizer_set_is_adjusted_current_node_foreign
      11   1.4%  75.4%       14   1.8% handle_text
      10   1.3%  76.6%       17   2.1% maybe_resize_string_buffer
       9   1.1%  77.8%        9   1.1% update_position (inline)
       8   1.0%  78.8%       26   3.3% __libc_malloc
...

html5

I won't attempt to analyze that today, but the amount of time spent in pthread_attr_setschedparam makes me scratch my head, as does the _init entry. Anything in particular you want me to dig into?

@stevecheckoway
Copy link
Contributor

Some of that is quite odd. The fact that gumbo_tokenizer_set_is_adjusted_current_node_foreign is showing up at all is surprising.

I just looked at the assembly expecting that the code would be essentially be a few loads and a store. This is the only line of the function that does anything:

  parser->_tokenizer_state->_is_adjusted_current_node_foreign = is_foreign;

The assembly that is being emitted contains calls to the empty gumbo_debug function. I've got a simple fix for that:

diff --git a/gumbo-parser/src/util.c b/gumbo-parser/src/util.c
index d1ab2d7a..6238c296 100644
--- a/gumbo-parser/src/util.c
+++ b/gumbo-parser/src/util.c
@@ -63,6 +63,4 @@ void gumbo_debug(const char* format, ...) {
   va_end(args);
   fflush(stdout);
 }
-#else
-void gumbo_debug(const char* UNUSED_ARG(format), ...) {}
 #endif
diff --git a/gumbo-parser/src/util.h b/gumbo-parser/src/util.h
index dfdf465b..5c6ddd8c 100644
--- a/gumbo-parser/src/util.h
+++ b/gumbo-parser/src/util.h
@@ -21,7 +21,11 @@ void* gumbo_realloc(void* ptr, size_t size) RETURNS_NONNULL;
 void gumbo_free(void* ptr);

 // Debug wrapper for printf
+#ifdef GUMBO_DEBUG
 void gumbo_debug(const char* format, ...) PRINTF(1);
+#else
+static inline void gumbo_debug(const char* UNUSED_ARG(format), ...) PRINTF(1) {}
+#endif

 #ifdef __cplusplus
 }

After that change, the emitted code is

                     _gumbo_tokenizer_set_is_adjusted_current_node_foreign:
0000000000137474         ldr        x8, [x0, #0x10]                             ; CODE XREF=_gumbo_parse_with_options+908
0000000000137478         strb       w1, [x8, #0x5]
000000000013747c         ret

But even that could be inlined with link-time optimization. (In fact, doing a link time optimization and also explicitly setting the symbols that are exported from the nokogiri.{bundle,so,dll} is likely to have a measurable performance win. Setting which symbols are exported should speed up program startup times in particular.

Those two seem like easy wins although setting which symbols are exported may impact downstream projects that rely on the Nokogiri C extension's symbols (like Nokogumbo did). Unfortunately, the 2-pass procedure where we first build a DOM tree using Gumbo's data structures and then build a DOM tree using libxml2's data structures does mean we have some essentially unavoidable overhead. Maybe gumbo could be modified to build a libxml2 DOM itself. That's likely a significant undertaking.

@flavorjones
Copy link
Member Author

I'm going to open up a new issue to dive into some these optimizations, since this issue is related but not specific to performance. Let's move the performance conversation to #2722.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants