JRuby Nokogiri raises StackOverflowError when parsing some pages #1501

marshalium · 2016-07-01T00:15:33Z

I have found at least one page that parses without error using Nokogiri 1.6.8 on Ruby 2.2.4 but raises a StackOverflowError on JRuby 1.7.25 and 9.1.2.0.

I expected Nokogiri to parse this file without raising any errors. Or at the very least raise errors consistently between the different Ruby versions.

Here is some Ruby code to reproduce the behavior:

puts RUBY_DESCRIPTION

require 'nokogiri'

puts "Nokigiri version #{Gem.loaded_specs['nokogiri'].version}"

html = <<-EOF
  <td>

  <!doctype html>
  <html>

  <head>
    <title></title>
  </head>

  <body>
    <p class='main'>example text</p>
    <p>
  </body>

  </html>
EOF

begin
  result = Nokogiri::HTML(html)
  puts "SUCCESS: .main text was #{result.at('.main').text.inspect}"
rescue Exception => e
  puts "ERROR: #{e.class}: #{e.message}\n#{e.backtrace.map { |x| "\t#{x}" }.join("\n")}"
  exit(1)
end

Here's an example session running it on all three Ruby versions:

$ rbenv shell 2.2.4
$ ruby nokogiri_stack_overflow.rb
ruby 2.2.4p230 (2015-12-16 revision 53155) [x86_64-darwin15]
Nokigiri version 1.6.8
SUCCESS: .main text was "example text"

$ rbenv shell jruby-1.7.25
$ ruby nokogiri_stack_overflow.rb
jruby 1.7.25 (1.9.3p551) 2016-04-13 867cb81 on Java HotSpot(TM) 64-Bit Server VM 1.7.0_80-b15 +jit [darwin-x86_64]
Nokigiri version 1.6.8
ERROR: Java::JavaLang::StackOverflowError:
  java.lang.Integer.parseInt(Integer.java:527)
  java.text.MessageFormat.makeFormat(MessageFormat.java:1418)
  java.text.MessageFormat.applyPattern(MessageFormat.java:479)
  java.text.MessageFormat.<init>(MessageFormat.java:363)
  java.text.MessageFormat.format(MessageFormat.java:835)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.formatMessage(HTMLConfiguration.java:646)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HTMLConfiguration.java:678)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportWarning(HTMLConfiguration.java:660)
  org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:666)
  org.cyberneko.html.HTMLTagBalancer.forceStartElement(HTMLTagBalancer.java:778)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1037)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  [... this line repeated 1,000 times ...]
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)

$ rbenv shell jruby-9.1.2.0
$ ruby nokogiri_stack_overflow.rb
jruby 9.1.2.0 (2.3.0) 2016-05-26 7357c8f Java HotSpot(TM) 64-Bit Server VM 24.80-b11 on 1.7.0_80-b15 +jit [darwin-x86_64]
Nokigiri version 1.6.8
ERROR: Java::JavaLang::StackOverflowError:
  java.lang.Character.digit(Character.java:6563)
  java.lang.Character.digit(Character.java:6511)
  java.lang.Integer.parseInt(Integer.java:578)
  java.lang.Integer.parseInt(Integer.java:615)
  java.text.MessageFormat.makeFormat(MessageFormat.java:1427)
  java.text.MessageFormat.applyPattern(MessageFormat.java:479)
  java.text.MessageFormat.<init>(MessageFormat.java:362)
  java.text.MessageFormat.format(MessageFormat.java:840)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.formatMessage(HTMLConfiguration.java:646)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HTMLConfiguration.java:678)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportWarning(HTMLConfiguration.java:660)
  org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:666)
  org.cyberneko.html.HTMLTagBalancer.forceStartElement(HTMLTagBalancer.java:778)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1037)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  [... this line repeated 1,000 times ...]
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126)
  org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093)
  org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
  org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
  org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
  org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
  org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
  nokogiri.internals.NokogiriDomParser.parse(NokogiriDomParser.java:94)
  nokogiri.internals.XmlDomParserContext.do_parse(XmlDomParserContext.java:248)
  nokogiri.internals.XmlDomParserContext.parse(XmlDomParserContext.java:234)
  nokogiri.HtmlDocument.do_parse(HtmlDocument.java:119)
  nokogiri.HtmlDocument.read_memory(HtmlDocument.java:187)
  nokogiri.HtmlDocument$INVOKER$s$0$0$read_memory.call(HtmlDocument$INVOKER$s$0$0$read_memory.gen)
  org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:742)
  org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:298)
  org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:79)
$

The text was updated successfully, but these errors were encountered:

flavorjones · 2017-02-10T10:16:58Z

This appears to be a bug in the upstream NekoHTML parser that Nokogiri uses. Would you be interested in reporting this upstream to that project?

marshalium · 2017-05-27T22:00:36Z

Hey @flavorjones! Sorry for the late reply.

I'm happy to also report this upstream. Is https://sourceforge.net/p/nekohtml/bugs/ the correct place to report to?

Any tips for getting a smaller test case that is less JRuby/Nokogiri specific, or do you think this stack as is will be useful to the upstream project?

flavorjones · 2018-03-20T21:55:52Z

Oooh, I continue to be later than you with my replies. :-\

When I stop catching this exception, the error becomes more clear:

jruby 9.1.15.0 (2.3.3) 2017-12-07 929fde8 OpenJDK 64-Bit Server VM 25.151-b12 on 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 [linux-x86_64]
Nokogiri version 1.8.2
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full java.lang.StackOverflowError stack trace

Let me spend a few minutes trying to build a repro in Java.

flavorjones · 2018-03-20T22:34:06Z

Welp, when I use the nekohtml HTML sample program to parse this, there's no error, which tells me it has something to do with how we've configured the parser. Here's that code:

package sample;

import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Node;

public class TestHTMLDOM {
    public static void main(String[] argv) throws Exception {
        DOMParser parser = new DOMParser();
        for (int i = 0; i < argv.length; i++) {
            parser.parse(argv[i]);
            print(parser.getDocument(), "");
        }
    }
    public static void print(Node node, String indent) {
        System.out.println(indent+node.getClass().getName());
        Node child = node.getFirstChild();
        while (child != null) {
            print(child, indent+" ");
            child = child.getNextSibling();
        }
    }
}

flavorjones · 2018-03-20T23:04:22Z

OK, I tried configuring the org.apache.xerces.parsers.DOMParser that Nokogiri uses to parse HTML, and I can't reproduce this.

Does anybody have time and bandwidth to look into trying to repro this in Java and filing a bug upstream?

jvshahid · 2018-03-22T22:15:54Z

I spent a few hours trying to debug this. I think it is this block of code that is causing the stack overflow. It is replacing the elements used by the HTMLTagBalancer which causes it's internal stack of visited elements to be wrong, eventually leading to the infinite loop. I'm still trying to understand what's going on but thought i would share what I found so far in case someone else is looking into this.

the patch accidentally removed the parents of the TR element. This caused any document fragment with a dangling (i.e. with no parent) TD or TR element to cause a stack overflow fixes #1501

jvshahid · 2018-03-25T14:40:45Z

pushed a fix in #1743. The parsed document is different on JRuby and MRI. Not sure if that's something we want to try to fix or just treat it as a Xerces/libxml expected difference. I would also like some ideas on how to test it.

this is an ugly change whose only purpose is to mask the difference between libxml and nekohtml. we agreed to stop doing that a while ago and just accept that different libraries will behave different. furthermore, it caused a stack overflow while parding documents with a TD element that doesn't have any parents in #1501 fixes #1501

marshalium · 2018-03-26T23:47:56Z

Thank you for working on this @jvshahid!

flavorjones · 2018-03-29T19:22:52Z

Should be fixed in the next release.

flavorjones added the platform/jruby label Oct 3, 2016

flavorjones added the vendored/nekohtml label Feb 10, 2017

flavorjones added the help wanted label Mar 20, 2018

jvshahid mentioned this issue Mar 25, 2018

fix a monkey patch introduced in #1251 #1743

Closed

jvshahid mentioned this issue Mar 25, 2018

remove monkey patch introduced in #1251 #1744

Merged

flavorjones closed this as completed in #1744 Mar 29, 2018

flavorjones added this to the 1.8.3 milestone Mar 29, 2018

flavorjones mentioned this issue Oct 15, 2018

Nokogiri on JRuby adds tbody tag to parsed document, resulting in inconsistent query results #1803

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JRuby Nokogiri raises StackOverflowError when parsing some pages #1501

JRuby Nokogiri raises StackOverflowError when parsing some pages #1501

marshalium commented Jul 1, 2016 •

edited

flavorjones commented Feb 10, 2017

marshalium commented May 27, 2017

flavorjones commented Mar 20, 2018

flavorjones commented Mar 20, 2018

flavorjones commented Mar 20, 2018

jvshahid commented Mar 22, 2018

jvshahid commented Mar 25, 2018

marshalium commented Mar 26, 2018

flavorjones commented Mar 29, 2018

JRuby Nokogiri raises StackOverflowError when parsing some pages #1501

JRuby Nokogiri raises StackOverflowError when parsing some pages #1501

Comments

marshalium commented Jul 1, 2016 • edited

flavorjones commented Feb 10, 2017

marshalium commented May 27, 2017

flavorjones commented Mar 20, 2018

flavorjones commented Mar 20, 2018

flavorjones commented Mar 20, 2018

jvshahid commented Mar 22, 2018

jvshahid commented Mar 25, 2018

marshalium commented Mar 26, 2018

flavorjones commented Mar 29, 2018

marshalium commented Jul 1, 2016 •

edited