Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

> The maintainers of Nokogiri and Nokogumbo are planning to merge the two gems together so that Nokogiri assimilates Nokogumbo's HTML5 parsing functionality. #70

Closed
hino-ryu opened this issue May 23, 2021 · 0 comments

Comments

@hino-ryu
Copy link

hino-ryu commented May 23, 2021

The maintainers of Nokogiri and Nokogumbo are planning to merge the two gems together so that Nokogiri assimilates Nokogumbo's HTML5 parsing functionality.

This issue is intended to be a "parent" issue which can be followed to understand the plan and how it's progressing, as the work will likely take O(weeks) and will be on a feature branch driven by multiple PRs.

This description will be edited as we go to reflect current state and progress.

Background

This work has previously been discussed:

Goals

Here's what success looks like:

  • Nokogiri v1.12 supports HTML5 parsing using Nokogumbo's API, code, and fork of Gumbo.
  • The Nokogumbo maintainers feel welcomed into the Nokogiri maintainer team.
  • Upgrade paths for users are easy to navigate, and users don't get "stuck" on old versions.
  • The Gumbo parser ships as a precompiled library in all the native platform gems supported by Nokogiri v1.11.
  • The end-state of Nokogumbo is a (mostly) empty gem that prompts users to replace their dependency on Nokogumbo with Nokogiri.

For more specific objectives, see the "Punchlist" section below.

Note: No JRuby support at this time

⚠️ JRuby support is not going to be addressed as part of this work. However, we feel it is important to think about and we hope to work on this in the future. If you're interested in helping with HTML5 support on JRuby, please comment on this issue or ping the maintainers on the mailing list or the Discord channel.

The Nokogumbo code relies on a parser implemented in C, and a C extension that is tightly coupled to libxml2. As a result, the Nokogiri::HTML5 module will not be immediately available on JRuby, which uses Xerces in place of libxml2.

We ask that all downstream libraries be aware of this platform limitation as they consider using HTML5 parsing methods post-merger.

Risks

This work doesn't feel very risky, but if I had to name the riskiest bits:

  • Precompiling Gumbo for all our platforms might be challenging.
  • Gumbo is no longer supported upstream by Google, so we are assuming responsibility for keeping it secure, performant, and up-to-date with the HTML5 spec.
  • There's no way to prevent using a new Nokogiri with an old Nokogumbo and seeing lots of annoying "constant redefined" warning messages.

Some things that could have been risky but aren't:

  • License compatibility.

    • The Nokogumbo license is APL2.0. APL2.0 is compatible with MIT, and so we will maintain that license for specific files, preserving the notices required by APL2.0. We are already doing this with several files in the JRuby Java implementation.

Frequently Asked Questions

Why is this going into v1.12 and not v2.0?

This is not a breaking change. We want everyone on v1.11 to upgrade, and will be making efforts to make that upgrade painless.

Will Nokogiri's current HTML API or parsing behavior change?

No. Nokogiri's existing HTML parsing functionality, available under the Nokogiri::HTML module/namespace, will not change in this release. The Nokogumbo additions to Nokogiri are all contained under the Nokogiri::HTML5 module/namespace, and do not conflict with existing Nokogiri functionality.

In the future, we may explore how Nokogiri might be smarter about HTML4 versus HTML5 parsing, but those changes will be introduced carefully. See some more thoughts on this topic at #2064 (comment)

Will JRuby support HTML5?

Not initially, though we hope to work on this for a future release. See the section above for more information.

Punchlist

Some finer-grained objectives (which will be modified over time as we discover new work to be done):

Pre-merger

  • Release Nokogumbo version which no-ops if Nokogiri is providing HTML5 functionality

Team Merger

  • Nokogumbo contributors are added to the Nokogiri gemspec, README, and copyright declarations.
  • Nokogumbo contributors are added to @sparklemotion/nokogiri-core
  • Send some welcome gifts to the Nokogumbo maintainers.

Legal/License Merger

  • All nokogumbo files should mention they are originally licensed under Apache 2.0 (an interpretation of APL2.0 clause 4.c) and mention that they have been changed (clause 4.b)
  • Changed gumbo-parser files should "carry prominent notices stating that You changed the files" (APL2.0 clause 4.b)
  • Update LICENSE-DEPENDENCIES.md to include gumbo and nokogumbo under Apache License 2.0 (APL2.0 clause 4.a)

Functional Merger

  • Commit history for Nokogumbo is preserved in the Nokogiri repository.
  • Unit tests include an integration contract for rubys/nokogumbo#171
  • Nokogumbo unit tests integrated into rake test
  • Nokogumbo's HTML5 files moved into lib/nokogiri and C file moved into ext/nokogiri
  • extconf.rb builds Nokogumbo and Gumbo into nokogiri.so
  • Nokogumbo unit tests are green on a dev system
  • Valgrind memory testing is green on HTML5 functionality
  • Gumbo can be precompiled for linux 64-bit and 32-bit.
  • Gumbo can be precompiled for windows 64-bit and 32-bit.
  • Gumbo can be precompiled for darwin x64-64 and arm64.
  • Gumbo can be compiled from the vanilla gem on platforms without a native gem.
  • Integrate test coverage platform breadth from Nokogumbo's CI
  • Import the html5lib tests into the repository and the test suite, and get it green in CI

Pre-release

  • Alias Nokogiri::HTML4 to Nokogiri::HTML to prepare for possibly changing default behavior in a future release.
  • HTML5 rdocs should note that the API is available only starting in v1.12, and only in CRuby (not JRuby).
  • Update README and Tutorials
  • Update #2205 for people coming from a Nokogumbo warning message
  • Integration test with Nokogumbo v2.0.4 (with warnings) and v2.0.5 (without warnings)

/cc @rubys @stevecheckoway

Originally posted by @hino-ryu in sparklemotion/nokogiri#2204 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant