JRuby XPATH Memory Usage #1749

peoplesmeat · 2018-04-03T16:09:18Z

When using Nokogiri on jruby with a nested XPATH loop, the document memory footpring explodes in size.

For example, with a document like
<items> <item> <value1> <value2> <item>
... for many thousands of items

And attempting to use (for example) an xpath like: doc.xpath('items').each { |node| node.xpath('value1') }

You'll wind up with a document that could be hundreds of megabytes large due to caching in the CACHED_XPATH_CONTEXT layer. Specifically the nokogiri.internals.XalanDTMManagerPatch winds up with thousands of values in "m_dtm". I'm not an expert in this area and am unclear what that terminology is referencing. In my case I had a document with 4000 items taking 4GB of memory of cached xpath. And there appears to be no way to clear that specific cache.

This behavior is not present in the mri ruby version.

The text was updated successfully, but these errors were encountered:

flavorjones · 2018-04-03T22:13:22Z

Hi,

Thanks for reporting this. The issue template that you deleted when filing this had a few key questions that will help us reproduce and diagnose this issue:

What's the output from nokogiri -v?

Can you provide a self-contained script that reproduces what you're seeing?

Based on what you've written, I'm assuming we'll be able to reproduce it, but helping us out means we're likely to get to it sooner, and we'll have a test case to work against.

flavorjones · 2018-04-03T22:17:12Z

A couple of things that a working example would clarify: the document structure (are the tags intended to be unclosed, or self-closing, or ...), the query (is the xpath really value1 or did you mean //value1, or ...)

flavorjones · 2018-04-03T22:27:01Z

Here's my attempt to reproduce:

#!/usr/bin/env ruby

require 'nokogiri'

xml = '<items>' + '<item><value1/><value2/></item>'*4000

doc = Nokogiri::XML xml

puts "pid is #{$$}"

loop do
  doc.xpath('items').each { |node| node.xpath('//value1') }
  system "cat /proc/#{$$}/status | egrep 'VmSize|VmRSS'"
end

on MRI with this config:

# Nokogiri (1.8.2)
    ---
    warnings: []
    nokogiri: 1.8.2
    ruby:
      version: 2.4.1
      platform: x86_64-linux
      description: ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/flavorjones/.rvm/gems/ruby-2.4.1/gems/nokogiri-1.8.2/ports/x86_64-pc-linux-gnu/libxml2/2.9.7"
      libxslt_path: "/home/flavorjones/.rvm/gems/ruby-2.4.1/gems/nokogiri-1.8.2/ports/x86_64-pc-linux-gnu/libxslt/1.1.32"
      libxml2_patches: []
      libxslt_patches: []
      compiled: 2.9.7
      loaded: 2.9.7

memory usage stabilizes at:

VmSize:	   75316 kB
VmRSS:	   26968 kB

on JRuby with this config:

# Nokogiri (1.8.2)
    ---
    warnings: []
    nokogiri: 1.8.2
    ruby:
      version: 2.3.3
      platform: java
      description: jruby 9.1.15.0 (2.3.3) 2017-12-07 929fde8 OpenJDK 64-Bit Server VM
        25.162-b12 on 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12 [linux-x86_64]
      engine: jruby
      jruby: 9.1.15.0
    xerces: Xerces-J 2.11.0
    nekohtml: NekoHTML 1.9.21

the memory usage is much larger, stabilizing around:

VmSize:	 3068068 kB
VmRSS:	  401372 kB

does this match what you're seeing?

flavorjones · 2018-04-03T22:28:11Z

(I'll note that just the JRuby interpreter is rather large, around

VmSize:	 3065972 kB
VmRSS:	  168000 kB

)

peoplesmeat · 2018-04-03T23:05:10Z

Thanks for taking a look at this, sorry for not including some vital pieces of information. I wasn't sure this wouldn't be chalked up to general terribleness of xpath on java (ala #741 ). I'll have to collect the details tomorrow, but in my case I was using jruby-9.1.14.0 and nokogiri 1.8.2 on java8. I was also looking at memory dumps to get the exact size details because even with 2GB heap space 1 or 2 documents was running out of space. And of course there were XML namespaces involved so it'll take a bit of work to narrow down a representative example.

peoplesmeat · 2018-04-04T23:21:08Z

Here's a representative script to demonstrate the problem. This script will blow a 2GB heap in a few seconds.

#!/usr/bin/env ruby
require 'nokogiri'

items = %Q(
<net:Item>
  	<net:InformationBlock>
		derp
  	</net:InformationBlock>
</net:Item>
) * 5000

xml = %Q(
<net:Information xmlns:net="urn:payload:information">
  <net:Items>
	#{items}
  </net:Items>
</net:Information>
)
doc = Nokogiri::XML xml
puts "pid is #{$$}"
doc.xpath('./net:Information/net:Items/net:Item').each do |node| 
  node.xpath('./net:InformationBlock')
end

Nokogiri (1.8.2)
---
warnings: []
nokogiri: 1.8.2
ruby:
version: 2.3.3
platform: java
description: jruby 9.1.14.0 (2.3.3) 2017-11-08 2176f24 Java HotSpot(TM) 64-Bit Server
VM 25.91-b14 on 1.8.0_91-b14 +jit [darwin-x86_64]
engine: jruby
jruby: 9.1.14.0
xerces: Xerces-J 2.11.0
nekohtml: NekoHTML 1.9.21

jeremyhaile · 2018-08-22T00:30:04Z

We are also running out of memory on 8GB of heap and it seems to be this issue – heap dump shows XPathContexts with gigs of int[]s tied back to m_dtms.

flavorjones · 2018-08-24T17:51:52Z

I'm just back from vacation and it will take a few days to catch up on everything. Thanks for your patience.

jeremyhaile · 2018-08-30T15:16:05Z

Here is a modified reproducible example that is closer to the issue we're having. It uses HTML and uses css instead of xpath to select (which under the hood are the same). Also the .each from the original example is unnecessary.

On a 2GB heap this will crash in a few seconds.

require 'nokogiri'

items = %Q(
<div class="test">
  <a href="something">derp</a>
</div>
) * 5000

html = %Q(
<html>
  <body>
    #{items}
  </body>
</html>
)
doc = Nokogiri::HTML html
puts "pid is #{$$}"
links = doc.css('html body .test a')
puts "Without nesting got #{links.length} links" # This will return immediately

links = doc.css('html body .test').css('a')
puts "With nesting got #{links.length} links" # This will crash a 2GB heap and never output

jeremyhaile · 2018-08-31T22:26:04Z

@flavorjones we are open to trying to fix this ourselves, but are unfamiliar with the source. Do you have any idea where to even start looking here?

I would really appreciate any help you can provide as this is causing our production server to crash on a regular basis. We've tried to workaround this issue by never doing nested xpath/css, but perhaps we missed something or a third-party lib (e.g. readability) is doing it because it keeps running out of memory. (even with a 12GB heap)

The heap dump shows nokogiri objects filling up the heap:

kares · 2018-09-01T08:36:20Z

also noticed increased memory usage under JRuby from ~ 1.6 - thought maybe some of my changes are to blame but there never was a production leak even with heavy xpath use on 1-2M xmls.

@jeremyhaile you simply need to try understanding the xalan library used under the hood and why its keeping the internal state around, maybe it's xpath cache related. just some minor hints - its definitely resolvable if enough quality time is put in (no clear guess really as some internal pieces get tricky).

jeremyhaile · 2018-09-03T13:06:39Z

I noticed that the class that is taking up all of the heap is actually in the nokogiri source, despite being in the org.apache namespace: https://github.com/sparklemotion/nokogiri/blob/master/ext/java/org/apache/xml/dtm/ref/dom2dtm/DOM2DTMExt.java

Is this class overriding behavior from Xalan?

jeremyhaile · 2018-09-04T16:05:52Z

@kares I noticed you wrote a lot of the code in XalanDTMManagerPatch and DOM2DTMExt. Do you have any ideas on where this might be occurring? I notice that the getDTM method was overridden to return Dom2DTMExt objects, and those are then added via addDTM. I'm not sure whether there should be less DTM nodes or whether it's a problem within the DTM nodes that causes it to retain too many objects. Would really appreciate some help!

Looks like we replaced DOM2DTM with DOM2DTMExt when we fixed #1320 but forgot to replace it in the DOM2DTM manager fixes #1749

jvshahid · 2018-09-05T03:44:12Z

@jeremyhaile can you try the branch in #1792 and let me know if it fixes your issue

kares · 2018-09-05T07:24:30Z

@jeremyhaile no ideas atm - would need to dive deep on this one to really understand what's going on
hopefully John's fix does resolve it - seems like there's some left-over to be handled ... if not get in touch.

jeremyhaile · 2018-09-05T18:22:41Z

@kares The branch from @jvshahid fixes the memory issue. However, as I outlined along with a reproducible test case – there is still a huge performance penalty incurred by nested xpath queries, and the penalty seems to grow exponentially based on the number of elements being searched.

Here is my relevant comment on the PR:
#1792 (comment)

Looks like we replaced DOM2DTM with DOM2DTMExt when we fixed #1320 but forgot to replace it in the DOM2DTM manager fixes #1749

flavorjones added the platform/jruby label Apr 3, 2018

flavorjones added the needs/more-info label Apr 3, 2018

jvshahid added a commit that referenced this issue Sep 5, 2018

Replace DOM2DTM with DOM2DTMExt the patched version of DOM2DTM

9ce7540

Looks like we replaced DOM2DTM with DOM2DTMExt when we fixed #1320 but forgot to replace it in the DOM2DTM manager fixes #1749

jvshahid mentioned this issue Sep 5, 2018

Replace DOM2DTM with DOM2DTMExt the patched version of DOM2DTM (fixes 1749) #1792

Merged

flavorjones pushed a commit that referenced this issue Dec 1, 2018

Replace DOM2DTM with DOM2DTMExt the patched version of DOM2DTM

ef8e089

Looks like we replaced DOM2DTM with DOM2DTMExt when we fixed #1320 but forgot to replace it in the DOM2DTM manager fixes #1749

flavorjones pushed a commit that referenced this issue Dec 1, 2018

Replace DOM2DTM with DOM2DTMExt the patched version of DOM2DTM

dfc19d8

Looks like we replaced DOM2DTM with DOM2DTMExt when we fixed #1320 but forgot to replace it in the DOM2DTM manager fixes #1749

flavorjones closed this as completed in #1792 Dec 2, 2018

flavorjones mentioned this issue Dec 4, 2018

fix dom2dtm to work in j9 #1830

Merged

kakoni mentioned this issue Jan 24, 2019

Update nokogiri version takumakanari/embulk-parser-xml#10

Merged

flavorjones removed the needs/more-info label Feb 16, 2021

This was referenced Mar 14, 2021

Bump nokogiri from 1.6.8 to 1.9.1 obi4589/c_i#8

Closed

Bump nokogiri from 1.6.6.2 to 1.9.1 maxcobmara/ogma#1006

Merged

Bump nokogiri from 1.6.8 to 1.9.1 perangusta/devise_security_extension#4

Open

Bump nokogiri from 1.6.7.2 to 1.9.1 perangusta/helpy#2

Open

This was referenced Mar 17, 2021

Bump nokogiri from 1.6.2.1 to 1.9.1 ryanlabouve/techlahoma#2

Open

Bump nokogiri from 1.6.6.2 to 1.9.1 IBazylchuk/livequiz#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JRuby XPATH Memory Usage #1749

JRuby XPATH Memory Usage #1749

peoplesmeat commented Apr 3, 2018 •

edited

flavorjones commented Apr 3, 2018

flavorjones commented Apr 3, 2018

flavorjones commented Apr 3, 2018

flavorjones commented Apr 3, 2018

peoplesmeat commented Apr 3, 2018

peoplesmeat commented Apr 4, 2018

jeremyhaile commented Aug 22, 2018

flavorjones commented Aug 24, 2018

jeremyhaile commented Aug 30, 2018 •

edited

jeremyhaile commented Aug 31, 2018

kares commented Sep 1, 2018

jeremyhaile commented Sep 3, 2018

jeremyhaile commented Sep 4, 2018

jvshahid commented Sep 5, 2018

kares commented Sep 5, 2018

jeremyhaile commented Sep 5, 2018 •

edited

JRuby XPATH Memory Usage #1749

JRuby XPATH Memory Usage #1749

Comments

peoplesmeat commented Apr 3, 2018 • edited

flavorjones commented Apr 3, 2018

flavorjones commented Apr 3, 2018

flavorjones commented Apr 3, 2018

flavorjones commented Apr 3, 2018

peoplesmeat commented Apr 3, 2018

peoplesmeat commented Apr 4, 2018

jeremyhaile commented Aug 22, 2018

flavorjones commented Aug 24, 2018

jeremyhaile commented Aug 30, 2018 • edited

jeremyhaile commented Aug 31, 2018

kares commented Sep 1, 2018

jeremyhaile commented Sep 3, 2018

jeremyhaile commented Sep 4, 2018

jvshahid commented Sep 5, 2018

kares commented Sep 5, 2018

jeremyhaile commented Sep 5, 2018 • edited

peoplesmeat commented Apr 3, 2018 •

edited

jeremyhaile commented Aug 30, 2018 •

edited

jeremyhaile commented Sep 5, 2018 •

edited