Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression in node.replace(X) when X is a huge string of markup ("error parsing fragment (1)") #1442

Closed
twelve17 opened this issue Mar 4, 2016 · 12 comments

Comments

@twelve17
Copy link

twelve17 commented Mar 4, 2016

Given Nokogiri::HTML document A (a large HTML document with many nested tags) and Nokogiri::HTML document B (which can be a simple document: <html><body></body></html>), when trying to replace a node in A with the string contents of B, this error is thrown:

RuntimeError: error parsing fragment (1)
from /home/foo/.rbenv/versions/2.1.2/lib/ruby/gems/2.1.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb:404:in `in_context'

This occurs with Nokogiri gem version 1.6.7.2. It does not seem to occur with Nokogiri 1.6.6.4. I am on Mac OS X 10.11.3. I think NG was built using the system libxml2 library: /usr/lib/libxml2.2.dylib.

I have attached a small script (nokogiri_large_doc_replace_test.zip) to reproduce the problem in version 1.6.7.2. The script will create the large document A with many nested elements until it reaches an arbitrary size, then create the document B and attempt to do documentB.replace(large_markup_inner_html_string_from_A) to see if the exception is thrown.

The arbitrary size threshold that throws an error on my system is noted in the constant at the beginning of the script: SIZE_THRESHOLD = 16505. If I set this to something low (say 10), I do not get the exception.

I am afraid I have not yet pinpointed the change or cause that results in this error in the newer version. However, maybe these bits I have found while digging into it will help (and not misdirect!):

  1. I see that now there is an option to disable the underlying hardcoded limit for large files (via: Support for libxml XML_PARSE_HUGE #430).
  2. When doing node.replace with a markup string argument, #coerce is called, which calls #fragment, which creates a new DocumentFragment.
  3. DocumentFragment's initializer checks for a context (which exists when coming from node.fragment(string)), and then calls #parse on it.
  4. The #parse method builds a new options object, with some defaults. Alas, the huge option one may have set in the original document does not seem available here to be propagated to this object. The subsequent in_context call fails, presumably because the underlying parser reaches the arbitrary limit and breaks. If I modify this options object and add the huge option, the fragment is created successfully.

I should emphasize that even though I found a semblance of cause in step 4, I do not yet know what changed between 1.6.6.4 and 1.6.7.2--maybe it was something else!

Update 1: I just realized that in the test script, line 46, I am doing a control experiment, where I read the large markup string into a new Nokogiri::HTML, and, in that call, I am not using the huge option, yet it still works. So perhaps my usage of huge is masking an underlying problem.

@twelve17 twelve17 changed the title Possible regression in node.replace(X) when X is a huge string of markup Possible regression in node.replace(X) when X is a huge string of markup ("error parsing fragment (1)") Mar 4, 2016
@flavorjones
Copy link
Member

Thanks for reporting this, I'll take a look and try to reproduce.

@flavorjones
Copy link
Member

Can you please provide the output from nokogiri -v on your system?

@flavorjones
Copy link
Member

More precisely: can you please provide nokogiri -v output from both of your Nokogiri installations.

@flavorjones
Copy link
Member

I've reproduced this issue, seeing an error with 1.6.7.2 and not seeing an error with 1.6.6.4.

Both versions of nokogiri were compiled against the vendored libxml2, and were running on Ruby 2.3.0.

@flavorjones
Copy link
Member

OK, this behavior was introduced in commit 9eb540e which applied upstream libxml2 patches to the vendored libxml 2.9.2.

@flavorjones
Copy link
Member

And notably this behavior is also present in nokogiri 1.6.8.rc3 which uses libxml 2.9.3. This indicates to me that this is entirely due to upstream behavior introduced in libxml2.

@flavorjones
Copy link
Member

Right, OK, so if you rescue the exception and examine the document's errors, you'll see:

[#<Nokogiri::XML::SyntaxError: Excessive depth in document: 256 use XML_PARSE_HUGE option>]

which simply means that the depth is deeper than libxml2 allows.

Notably, you're not setting the HUGE config on the document -- Document#new doesn't accept a config block. Use Document#parse instead.

That said, the HUGE option doesn't seem to do anything; a doc with this option set or not set still raises this error at SIZE_THRESHOLD 7484.

In any case, if you feel strongly that this is a bug, I'd ask that you submit a bug report upstream to the libxml2 project.

Since there's nothing Nokogiri can do to change the behavior of libxml2, I'm going to close this issue. If you'd like to continue the conversation, I'm happy to keep talking, just respond here.

Sorry this isn't a more helpful response.

@flavorjones
Copy link
Member

Please note I've create #1443 to investigate the possibility that we're not passing parse options into the fragment parser; this is related to my note above about the HUGE option having no apparent effect.

@twelve17
Copy link
Author

twelve17 commented Mar 6, 2016

Hi @flavorjones : quick correction. I ran into this originally on my Mac laptop but then moved over to a Linux box to eventually come up with the script I originally attached. Here's the info for both versions on the Linux box:

1.6.7.2

# Nokogiri (1.6.7.2)
    ---
    warnings: []
    nokogiri: 1.6.7.2
    ruby:
      version: 2.1.2
      platform: x86_64-linux
      description: ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/foo/.rbenv/versions/2.1.2-with-shared-libs/lib/ruby/gems/2.1.0/gems/nokogiri-1.6.7.2/ports/x86_64-unknown-linux-gnu/libxml2/2.9.2"
      libxslt_path: "/home/foo/.rbenv/versions/2.1.2-with-shared-libs/lib/ruby/gems/2.1.0/gems/nokogiri-1.6.7.2/ports/x86_64-unknown-linux-gnu/libxslt/1.1.28"
      libxml2_patches:
      - 0001-Revert-Missing-initialization-for-the-catalog-module.patch
      - 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
      - 0003-Stop-parsing-on-entities-boundaries-errors.patch
      - 0004-Cleanup-conditional-section-error-handling.patch
      - 0005-CVE-2015-1819-Enforce-the-reader-to-run-in-constant-.patch
      - 0006-Another-variation-of-overflow-in-Conditional-section.patch
      - 0007-Fix-an-error-in-previous-Conditional-section-patch.patch
      - 0008-CVE-2015-8035-Fix-XZ-compression-support-loop.patch
      - 0009-Updated-config.guess.patch
      - 0010-Fix-parsering-short-unclosed-comment-uninitialized-access.patch
      - 0011-Avoid-extra-processing-of-MarkupDecl-when-EOF.patch
      - 0012-Avoid-processing-entities-after-encoding-conversion-.patch
      - 0013-CVE-2015-7497-Avoid-an-heap-buffer-overflow-in-xmlDi.patch
      - 0014-CVE-2015-5312-Another-entity-expansion-issue.patch
      - 0015-Add-xmlHaltParser-to-stop-the-parser.patch  
      - 0016-Detect-incoherency-on-GROW.patch
      - 0017-CVE-2015-7500-Fix-memory-access-error-due-to-incorre.patch
      - 0018-CVE-2015-8242-Buffer-overead-with-HTML-parser-in-pus.patch
      - 0019-Do-not-print-error-context-when-there-is-none.patch
      - 0020-xmlStopParser-reset-errNo.patch
      - 0021-Reuse-xmlHaltParser-where-it-makes-sense.patch
      libxslt_patches:
      - 0001-Adding-doc-update-related-to-1.1.28.patch   
      - 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
      - 0003-Initialize-pseudo-random-number-generator-with-curre.patch
      - 0004-EXSLT-function-str-replace-is-broken-as-is.patch
      - 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
      - 0007-Separate-function-for-predicate-matching-in-patterns.patch
      - 0008-Fix-direct-pattern-matching.patch
      - 0009-Fix-certain-patterns-with-predicates.patch  
      - 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
      - 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
      - 0014-Fix-for-bug-436589.patch
      - 0015-Fix-mkdir-for-mingw.patch
      - 0016-Fix-for-type-confusion-in-preprocessing-attributes.patch
      - 0017-Updated-config.guess.patch
      compiled: 2.9.2
      loaded: 2.9.2

1.6.6.4

# Nokogiri (1.6.6.4)
    ---
    warnings: []
    nokogiri: 1.6.6.4
    ruby:
      version: 2.1.2
      platform: x86_64-linux
      description: ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/foo/.rbenv/versions/2.1.2-with-shared-libs/lib/ruby/gems/2.1.0/gems/nokogiri-1.6.6.4/ports/x86_64-unknown-linux-gnu/libxml2/2.9.2"
      libxslt_path: "/home/foo/.rbenv/versions/2.1.2-with-shared-libs/lib/ruby/gems/2.1.0/gems/nokogiri-1.6.6.4/ports/x86_64-unknown-linux-gnu/libxslt/1.1.28"
      libxml2_patches:
      - 0001-Revert-Missing-initialization-for-the-catalog-module.patch
      - 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
      - 0003-Stop-parsing-on-entities-boundaries-errors.patch
      - 0004-Cleanup-conditional-section-error-handling.patch
      - 0005-CVE-2015-1819-Enforce-the-reader-to-run-in-constant-.patch
      - 0006-Another-variation-of-overflow-in-Conditional-section.patch
      - 0007-Fix-an-error-in-previous-Conditional-section-patch.patch
      - 0008-CVE-2015-8035-Fix-XZ-compression-support-loop.patch
      - 0010-Fix-parsering-short-unclosed-comment-uninitialized-access.patch
      libxslt_patches:
      - 0001-Adding-doc-update-related-to-1.1.28.patch
      - 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
      - 0003-Initialize-pseudo-random-number-generator-with-curre.patch
      - 0004-EXSLT-function-str-replace-is-broken-as-is.patch
      - 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
      - 0007-Separate-function-for-predicate-matching-in-patterns.patch
      - 0008-Fix-direct-pattern-matching.patch
      - 0009-Fix-certain-patterns-with-predicates.patch
      - 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
      - 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
      - 0014-Fix-for-bug-436589.patch
      - 0015-Fix-mkdir-for-mingw.patch
      - 0016-Fix-for-type-confusion-in-preprocessing-attributes.patch
      compiled: 2.9.2
      loaded: 2.9.2

@twelve17
Copy link
Author

twelve17 commented Mar 6, 2016

@flavorjones ,

Thanks for taking a look at this issue.

In terms of whether it's a bug or not: I should first bring up that I inadvertently muddled what I now see as two distinct issues in the script I originally attached.

  1. The huge option (and perhaps other options) not being passed along to subsequent implicit DocumentFragment parsing. I updated the script to use the parse approach, e.g.:
doc1 = Nokogiri::HTML("<html></html>") do |config|
  config.options = Nokogiri::XML::ParseOptions::HUGE
end

Despite specifying the huge option on the original document, it still does not appear that the in_context call in nokogiri-1.6.7.2/lib/nokogiri/xml/node.rb @ line 403 is aware of it:

[1] pry(#<Nokogiri::XML::Element>)> options
=> #<Nokogiri::XML::ParseOptions:0x007fde1322f800 ... recover, noerror, nowarning, nonet, default_xml, default_html>

That being said, I don't know that it would be trivial to carry over that config when dealing with arbitrary markup strings as a source to a elem.parse(markup_str) call, so I could understand not counting this a bug (although I do see that you did open #1443 to look into it, so thanks!).
2. What I now think is more notable, though, is that, even when omitting the huge option altogether, this throws the fragment error:

docB.xpath('//body').first.parse(bigDocA.xpath('//body').first.inner_html)

While this does not:

Nokogiri::HTML(bigDocA.xpath('//body').first.inner_html)

Which makes me feel like my usage of the huge option in the script is masking some other problem. The standard parse method can handle the markup, but element.parse(X) does not seem to be able to.

@twelve17
Copy link
Author

Hi @flavorjones -- just wanted to see if you saw my previous comment?

@flavorjones
Copy link
Member

Please note that this has been fixed by #2399 (related to #1692). You can now pass ParseOptions to fragment-building methods like Node.new

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants