Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.inner_html= causing memory issue #2349

Closed
vanboom opened this issue Oct 24, 2021 · 7 comments
Closed

.inner_html= causing memory issue #2349

vanboom opened this issue Oct 24, 2021 · 7 comments
Labels
meta/user-help topic/memory Segfaults, memory leaks, valgrind testing, etc.

Comments

@vanboom
Copy link

vanboom commented Oct 24, 2021

Please describe the bug
We are hunting down an issue in a Ruby on Rails application that uses Nokogiri to substitute some XML elements in the content.xml of a MS Word document.

Calling inner_html= repetitvely appears to have memory consequences that appear as a leak. The process memory continues to increase. This can be tested by simply setting inner_html to itself.

Help us reproduce what you're seeing
Pseudocode:

  1. Load some XML
  2. Iterate through all of the children nodes and call inner_html to replace the node's content.
  3. Check process memory along the way

We did this in our application IRB in development which includes the memory_profiler gem and nokogiri 1.12.5.

n = Nokogiri::XML(File.read('/tmp/test.xml'))
50000.times do n.children.map{|c| c.inner_html = c.inner_html}; end;
GetProcessMem.new.gb
irb(main):082:0> GetProcessMem.new.gb
=> 13.417617797851562

Expected behavior
We would expect that memory would not continue to increase with each call to inner_html=.
It seems that Nokogiri is not freeing up memory as it parses the string given to inner_html, or as it replaces the node's contents.

See attached XML file that we used for testing. Our appilication performs a similar work flow on a much larger XML document so the memory consumption issue grows much faster.

Environment

    ---
    warnings: []
    nokogiri:
      version: 1.12.5
      cppflags:
      - "-I/opt/rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/nokogiri-1.12.5-x86_64-linux/ext/nokogiri"
      - "-I/opt/rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/nokogiri-1.12.5-x86_64-linux/ext/nokogiri/include"
      - "-I/opt/rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/nokogiri-1.12.5-x86_64-linux/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.5.0
      platform: x86_64-linux
      gem_platform: x86_64-linux
      description: ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-linux]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - 0007-Fix-XPath-recursion-limit.patch
      libxml2_path: "/opt/rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/nokogiri-1.12.5-x86_64-linux/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.12
      loaded: 2.9.12
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      - 0002-Fix-xml2-config-check-in-configure-script.patch
      datetime_enabled: true
      compiled: 1.1.34
      loaded: 1.1.34
    other_libraries:
      zlib: 1.2.11
      libgumbo: 1.0.0-nokogiri

loremtest.zip

@vanboom vanboom added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Oct 24, 2021
@flavorjones
Copy link
Member

@vanboom Thank you for opening this issue, and I'm sorry that you're seeing an issue. I'll take a look later today to try to reproduce and diagnose what's happening.

@flavorjones flavorjones added topic/memory Segfaults, memory leaks, valgrind testing, etc. and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Oct 25, 2021
@flavorjones
Copy link
Member

@vanboom Unfortunately this is intended behavior. I'll try to briefly explain what's happening and why, and then we can chat about how to work around this.

Nokogiri's memory model (re: libxml2)

I've tried to write about the Nokogiri memory model before, see #1952 (comment) as an example; and I would like to write this up more completely (see #2183). But briefly:

Garbage collection ("GC") of libxml2 data structures is controlled by the Document object's lifecycle -- the libxml2 function xmlFreeDoc frees all the node data structures owned by the document, and so we take some pains to make sure that Document and its related Node objects are all GCed at the same time. This means that individual Node ruby object lifecycle is useless to free the underlying memory that's allocated by libxml2.

Importantly, the Document object keeps a reference to every Node ruby object it owns in a "node cache". Even if the node later gets unparented from the DOM tree, it's still kept around. This has the advantage of being easy to reason about (important for reliable behavior), but ends up being inefficient in some cases -- like yours.

Doing better

This memory model is simple, and emerged from evolving the design of Nokogiri in its early days; and although it's reliable and easy to reason about in a broad set of use cases, it's not the most complete or most efficient design.

We could improve it in a number of ways by introducing more complex handling of edge cases. Unfortunately, in practice, this has been hard to do well because there tend to be lots of edge cases that are challenging to reason about.

Another idea which I've been planning to investigate over the next few months is to take a hard look at the very simple memory model used by libxml-ruby as of v3.0.0 and later, which is described in that project's HISTORY as:

Instead of trying to return the same ruby object for each xmlnode, the bindings now create wrapper
ruby objects as needed which are then freed at the end of use. This allows most memory management
to be handled by libxml itself. Ruby only manages the lifespan of documents and parent xml nodes.
When those go out of scope, the underlying libxml objects are also freed. This implementation
requires almost no overhead, plays nicely with Nokogiri and appears to work much better.

That feels like an invasive change, though, and if we proceed we would be very careful about introducing it. I would want to resurrect the GC test suite, and I would want to first more-cleanly separate the memory management behavior from the functional behavior. That will take time and so this idea doesn't really help you today.

Avoiding this issue

The memory growth occurs in your case because existing nodes are being replaced within a Document. Without knowing more about your specific code and use case, I can only offer some general ideas; but I'd be happy to go into more depth with you if you can show me code that's closer to your real user case.

One idea is to treat the original document as an immutable data source, and create a second, separate document that is the "final" version. Something like:

original = Nokogiri::XML(File.read('/tmp/test.xml'))
output = Nokogiri::XML("<root></root>")
original.children.map do |c|
  output.root.add_child(c)
end

Another idea is to periodically "dup" the document, which only carries over the DOM nodes, and allows GC to free the unparented nodes:

n = Nokogiri::XML(File.read('/tmp/test.xml'))
500000.times do |j|
  n.children.map do |c|
    c.inner_html = c.inner_html
    n = n.dup if j % 100 == 0 # ← clone the document every 100 iterations
  end
end

If neither of these ideas is applicable, then I'd like to ask for a more specific use case so that I can be more helpful!

@vanboom
Copy link
Author

vanboom commented Oct 25, 2021

Thank you @flavorjones. Our app is using the odf-report gem and this is where the node contents replacement is happening: https://github.com/sandrods/odf-report/blob/16f5b38828d7734fd3e9a72aa1b000384296f7cc/lib/odf-report/field.rb#L22

The odf-report gem works by opening an Open Office document file, then text substituting template tags (e.g. [MY NAME]) in the document contents. As you can see by the reference, it performs the text substitution and then replaces the node's contetnt by calling inner_html=.

I think an improvement could be made to that gem, as it appears to call inner_html= even if there are no changes to the node's text.

Your suggestion about duping the document is interesting -- the odf-report logic is building a new document by duping the template document along the way, performing the text substitution. Would duping the node after performing inner_html= have the same effect of freeing the unparented nodes? (Note, at the time inner_html= is called, the node is not parented. It is added to the document later in the odf-report process.

I will keep digging and post back if I can think of anything further. Thanks!

@vanboom
Copy link
Author

vanboom commented Oct 25, 2021

Yes - minimizing calls to inner_html= in the odf-report gem reduced the memory footprint for our process from 3.3G to 750M.

@flavorjones
Copy link
Member

Closing, but please do let me know if you need any help or advice.

@vanboom
Copy link
Author

vanboom commented Nov 6, 2021

Thanks for your advice and for documenting the memory model in more detail here. Not sure I am smart enough yet with this code to attempt a PR but will do so if something becomes clearer to me. I find it concerning that a simple node replacement causes such a memory issue. Seems memory should be freed when the node is replaced if there is no variable to hold it in scope.

@flavorjones
Copy link
Member

Seems memory should be freed when the node is replaced if there is no variable to hold it in scope

Yes, but as I tried to explain above, this is challenging to determine given the current memory management model. I agree this needs to be improved and in fact said so above, at length, in the section of my response entitled "Doing Better."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta/user-help topic/memory Segfaults, memory leaks, valgrind testing, etc.
Projects
None yet
Development

No branches or pull requests

2 participants