Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRuby Nokogiri raises StackOverflowError when parsing some pages #1501

Closed
marshalium opened this issue Jul 1, 2016 · 9 comments · Fixed by #1744
Closed

JRuby Nokogiri raises StackOverflowError when parsing some pages #1501

marshalium opened this issue Jul 1, 2016 · 9 comments · Fixed by #1744

Comments

@marshalium
Copy link

marshalium commented Jul 1, 2016

I have found at least one page that parses without error using Nokogiri 1.6.8 on Ruby 2.2.4 but raises a StackOverflowError on JRuby 1.7.25 and 9.1.2.0.

I expected Nokogiri to parse this file without raising any errors. Or at the very least raise errors consistently between the different Ruby versions.

Here is some Ruby code to reproduce the behavior:

puts RUBY_DESCRIPTION

require 'nokogiri'

puts "Nokigiri version #{Gem.loaded_specs['nokogiri'].version}"

html = <<-EOF
  <td>

  <!doctype html>
  <html>

  <head>
    <title></title>
  </head>

  <body>
    <p class='main'>example text</p>
    <p>
  </body>

  </html>
EOF

begin
  result = Nokogiri::HTML(html)
  puts "SUCCESS: .main text was #{result.at('.main').text.inspect}"
rescue Exception => e
  puts "ERROR: #{e.class}: #{e.message}\n#{e.backtrace.map { |x| "\t#{x}" }.join("\n")}"
  exit(1)
end

Here's an example session running it on all three Ruby versions:

$ rbenv shell 2.2.4
$ ruby nokogiri_stack_overflow.rb
ruby 2.2.4p230 (2015-12-16 revision 53155) [x86_64-darwin15]
Nokigiri version 1.6.8
SUCCESS: .main text was "example text"

$ rbenv shell jruby-1.7.25
$ ruby nokogiri_stack_overflow.rb
jruby 1.7.25 (1.9.3p551) 2016-04-13 867cb81 on Java HotSpot(TM) 64-Bit Server VM 1.7.0_80-b15 +jit [darwin-x86_64]
Nokigiri version 1.6.8
ERROR: Java::JavaLang::StackOverflowError:
  java.lang.Integer.parseInt(Integer.java:527)
  java.text.MessageFormat.makeFormat(MessageFormat.java:1418)
  java.text.MessageFormat.applyPattern(MessageFormat.java:479)
  java.text.MessageFormat.<init>(MessageFormat.java:363)
  java.text.MessageFormat.format(MessageFormat.java:835)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.formatMessage(HTMLConfiguration.java:646)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HTMLConfiguration.java:678)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportWarning(HTMLConfiguration.java:660)
  org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:666)
  org.cyberneko.html.HTMLTagBalancer.forceStartElement(HTMLTagBalancer.java:778)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1037)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  [... this line repeated 1,000 times ...]
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)

$ rbenv shell jruby-9.1.2.0
$ ruby nokogiri_stack_overflow.rb
jruby 9.1.2.0 (2.3.0) 2016-05-26 7357c8f Java HotSpot(TM) 64-Bit Server VM 24.80-b11 on 1.7.0_80-b15 +jit [darwin-x86_64]
Nokigiri version 1.6.8
ERROR: Java::JavaLang::StackOverflowError:
  java.lang.Character.digit(Character.java:6563)
  java.lang.Character.digit(Character.java:6511)
  java.lang.Integer.parseInt(Integer.java:578)
  java.lang.Integer.parseInt(Integer.java:615)
  java.text.MessageFormat.makeFormat(MessageFormat.java:1427)
  java.text.MessageFormat.applyPattern(MessageFormat.java:479)
  java.text.MessageFormat.<init>(MessageFormat.java:362)
  java.text.MessageFormat.format(MessageFormat.java:840)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.formatMessage(HTMLConfiguration.java:646)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HTMLConfiguration.java:678)
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportWarning(HTMLConfiguration.java:660)
  org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:666)
  org.cyberneko.html.HTMLTagBalancer.forceStartElement(HTMLTagBalancer.java:778)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1037)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  [... this line repeated 1,000 times ...]
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1038)
  org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126)
  org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093)
  org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
  org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
  org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
  org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
  org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
  nokogiri.internals.NokogiriDomParser.parse(NokogiriDomParser.java:94)
  nokogiri.internals.XmlDomParserContext.do_parse(XmlDomParserContext.java:248)
  nokogiri.internals.XmlDomParserContext.parse(XmlDomParserContext.java:234)
  nokogiri.HtmlDocument.do_parse(HtmlDocument.java:119)
  nokogiri.HtmlDocument.read_memory(HtmlDocument.java:187)
  nokogiri.HtmlDocument$INVOKER$s$0$0$read_memory.call(HtmlDocument$INVOKER$s$0$0$read_memory.gen)
  org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:742)
  org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:298)
  org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:79)
$
@flavorjones
Copy link
Member

This appears to be a bug in the upstream NekoHTML parser that Nokogiri uses. Would you be interested in reporting this upstream to that project?

@marshalium
Copy link
Author

Hey @flavorjones! Sorry for the late reply.

I'm happy to also report this upstream. Is https://sourceforge.net/p/nekohtml/bugs/ the correct place to report to?

Any tips for getting a smaller test case that is less JRuby/Nokogiri specific, or do you think this stack as is will be useful to the upstream project?

@flavorjones
Copy link
Member

Oooh, I continue to be later than you with my replies. :-\

When I stop catching this exception, the error becomes more clear:

jruby 9.1.15.0 (2.3.3) 2017-12-07 929fde8 OpenJDK 64-Bit Server VM 25.151-b12 on 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 [linux-x86_64]
Nokogiri version 1.8.2
Error: Your application used more stack memory than the safety cap of 2048K.
Specify -J-Xss####k to increase it (#### = cap size in KB).
Specify -w for full java.lang.StackOverflowError stack trace

Let me spend a few minutes trying to build a repro in Java.

@flavorjones
Copy link
Member

Welp, when I use the nekohtml HTML sample program to parse this, there's no error, which tells me it has something to do with how we've configured the parser. Here's that code:

package sample;

import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Node;

public class TestHTMLDOM {
    public static void main(String[] argv) throws Exception {
        DOMParser parser = new DOMParser();
        for (int i = 0; i < argv.length; i++) {
            parser.parse(argv[i]);
            print(parser.getDocument(), "");
        }
    }
    public static void print(Node node, String indent) {
        System.out.println(indent+node.getClass().getName());
        Node child = node.getFirstChild();
        while (child != null) {
            print(child, indent+" ");
            child = child.getNextSibling();
        }
    }
}

@flavorjones
Copy link
Member

OK, I tried configuring the org.apache.xerces.parsers.DOMParser that Nokogiri uses to parse HTML, and I can't reproduce this.

Does anybody have time and bandwidth to look into trying to repro this in Java and filing a bug upstream?

@jvshahid
Copy link
Member

I spent a few hours trying to debug this. I think it is this block of code that is causing the stack overflow. It is replacing the elements used by the HTMLTagBalancer which causes it's internal stack of visited elements to be wrong, eventually leading to the infinite loop. I'm still trying to understand what's going on but thought i would share what I found so far in case someone else is looking into this.

jvshahid added a commit that referenced this issue Mar 25, 2018
the patch accidentally removed the parents of the TR element. This caused any
document fragment with a dangling (i.e. with no parent) TD or TR element to
cause a stack overflow

fixes #1501
@jvshahid
Copy link
Member

pushed a fix in #1743. The parsed document is different on JRuby and MRI. Not sure if that's something we want to try to fix or just treat it as a Xerces/libxml expected difference. I would also like some ideas on how to test it.

jvshahid added a commit that referenced this issue Mar 25, 2018
this is an ugly change whose only purpose is to mask the difference between
libxml and nekohtml. we agreed to stop doing that a while ago and just accept
that different libraries will behave different. furthermore, it caused a stack
overflow while parding documents with a TD element that doesn't have any
parents in #1501

fixes #1501
@marshalium
Copy link
Author

Thank you for working on this @jvshahid!

@flavorjones
Copy link
Member

Should be fixed in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants