Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support passing serialization_options to Nokogiri #171

Merged

Conversation

dmytro-savochkin
Copy link
Contributor

Nokogiri upgrade to 1.11.4 introduced an implicit upgrade of internal libxml2 library to 2.9.12 (for MRI at least).
However on that version of libxml2 they fixed handling of Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS flag (reference) so that it takes effect and forces Nokogiri parser to transform this (for example) <meta ...> into this <meta ...></meta> when document.mode = :xhtml is used.

The problem is that Roadie has these options hardcoded and there is no way for a Roadie user to change them without monkey patching. Hence this PR. It should allow Roadie users to remove NO_EMPTY_TAGS to stop Roadie and Nokogiri from transforming empty tags. This PR however is not only about this particular option, it would allow rewriting other options as well: for example, you could remove Nokogiri::XML::Node::SaveOptions::NO_DECLARATION to make Roadie and Nokogiri add <?xml ...> declaration tag at the very top of the document.

If this is merged I can prepare a PR for roadie-rails as well.

@Mange
Copy link
Owner

Mange commented Nov 15, 2022

This is a great start. Very well done!

I think I might want to not have this option, however. I think it complicates the API, and it feels like it should not be needed. I don't see why :xhtml should ever want the NO_EMPTY_TAGS option set, for example, so I'd rather have better defaults set in here.

The linked issue seem to say that the problem was only for a specific version of Nokogiri and that it would be fixed, but since it's a very old issue I take it the behavior wasn't ever restored.

In either case, I would love it if you could set some default options for the different modes and use them. It's great with a test or two that fail without the save option set.

@dmytro-savochkin
Copy link
Contributor Author

dmytro-savochkin commented Nov 15, 2022

This is a great start. Very well done!

Thanks!

I think I might want to not have this option, however. I think it complicates the API, and it feels like it should not be needed. I don't see why :xhtml should ever want the NO_EMPTY_TAGS option set, for example, so I'd rather have better defaults set in here.

Well, this was my first thoughts as well. But then I started thinking, what if someone needs it for something (even if I don't see why)? So that's why I proposed this PR, it does complicate things a little but it's guaranteed not to require anyone to do anything while migrating Roadie to a new version (unless there is a bug of course).
But that's definitely for you to decide what to do :) If you want to change defaults instead, I am all for it!

The linked issue seem to say that the problem was only for a specific version of Nokogiri and that it would be fixed, but since it's a very old issue I take it the behavior wasn't ever restored.

It was fixed in September of 2021 in Nokogiri, meaning they changed their defaults for the case of XHTML to NOT include NO_EMPTY_TAGS option. The problem with Roadie however is that it overwrites Nokogiri's defaults in https://github.com/Mange/roadie/blob/master/lib/roadie/document.rb#L193.

In either case, I would love it if you could set some default options for the different modes and use them. It's great with a test or two that fail without the save option set.

You mean something like this:

{
  html: (save_options::AS_HTML | save_options::NO_DECLARATION | save_options:: NO_EMPTY_TAGS),
  xhtml: (save_options::AS_XHTML | save_options::NO_DECLARATION),
  xml: save_options::AS_XML
}

?
I think this is what Nokogiri now has in main branch for MRI.

@Mange
Copy link
Owner

Mange commented Nov 17, 2022

Those choices sound very reasonable. I'm all for it. :-)

@dmytro-savochkin
Copy link
Contributor Author

Okay, I started implementing a PR to add new defaults and I found a problem.

Basically, now Nokogiri doesn't specify NO_EMPTY_TAGS for XML by default. Also W3 recommends using self-closing tags (<tag/>) as empty tags for XML if elements are declared empty:

Empty-element tags may be used for any element which has no content, whether or not it is declared using the keyword EMPTY. For interoperability, the empty-element tag should be used, and should only be used, for elements which are declared EMPTY.

It seems if there is no specific DTD or schema the preferred way is not defined.

However we do have a Roadie user who has a use case for XML to have empty tags specified in the format of start tag+end tag (<tag></tag>). Our very source code contains test cases for this!

So, what do we do now? If we choose Nokogiri's default approach we do break the existing code for at least one user, if we choose the said user's approach we go against Nokogiri's default and maybe even W3 recommendations in some cases.
I am starting to think that this PR is better since it looks like we do need some flexibility in this matter after all.

@Mange
Copy link
Owner

Mange commented Nov 19, 2022

Good research. You've changed my mind on the matter. 👍

@codecov-commenter
Copy link

Codecov Report

Base: 97.97% // Head: 97.99% // Increases project coverage by +0.02% 🎉

Coverage data is based on head (dc55ed6) compared to base (beb12f0).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #171      +/-   ##
==========================================
+ Coverage   97.97%   97.99%   +0.02%     
==========================================
  Files          57       57              
  Lines        1973     1994      +21     
==========================================
+ Hits         1933     1954      +21     
  Misses         40       40              
Impacted Files Coverage Δ
lib/roadie/document.rb 100.00% <100.00%> (ø)
spec/integration_spec.rb 100.00% <100.00%> (ø)
spec/lib/roadie/document_spec.rb 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@Mange Mange merged commit 8f2c810 into Mange:master Nov 19, 2022
@Mange
Copy link
Owner

Mange commented Nov 19, 2022

Version 5.1.0 was just released with this new feature in it. Thank you for your contribution! ❤️

@dmytro-savochkin dmytro-savochkin deleted the support-passing-serialization-options branch November 20, 2022 12:57
@dmytro-savochkin
Copy link
Contributor Author

This is great, thank you!

Now it's released and I was thinking to make a PR in roadie-rails to allow specifying serialization_options there as well.
However now after I've taken another look it seems maybe we don't need it. It looks like roadie-rails doesn't allow using any other mode except html and all the use cases known to me that require changing serialization options happen during serialization of either xhtml or xml.

So I think I am not gonna propose a PR there for now but let me know if I am missing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants