New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.inner_html= causing memory issue #2349
Comments
@vanboom Thank you for opening this issue, and I'm sorry that you're seeing an issue. I'll take a look later today to try to reproduce and diagnose what's happening. |
@vanboom Unfortunately this is intended behavior. I'll try to briefly explain what's happening and why, and then we can chat about how to work around this. Nokogiri's memory model (re: libxml2)I've tried to write about the Nokogiri memory model before, see #1952 (comment) as an example; and I would like to write this up more completely (see #2183). But briefly: Garbage collection ("GC") of libxml2 data structures is controlled by the Document object's lifecycle -- the libxml2 function xmlFreeDoc frees all the node data structures owned by the document, and so we take some pains to make sure that Document and its related Node objects are all GCed at the same time. This means that individual Node ruby object lifecycle is useless to free the underlying memory that's allocated by libxml2. Importantly, the Document object keeps a reference to every Node ruby object it owns in a "node cache". Even if the node later gets unparented from the DOM tree, it's still kept around. This has the advantage of being easy to reason about (important for reliable behavior), but ends up being inefficient in some cases -- like yours. Doing betterThis memory model is simple, and emerged from evolving the design of Nokogiri in its early days; and although it's reliable and easy to reason about in a broad set of use cases, it's not the most complete or most efficient design. We could improve it in a number of ways by introducing more complex handling of edge cases. Unfortunately, in practice, this has been hard to do well because there tend to be lots of edge cases that are challenging to reason about. Another idea which I've been planning to investigate over the next few months is to take a hard look at the very simple memory model used by libxml-ruby as of v3.0.0 and later, which is described in that project's HISTORY as:
That feels like an invasive change, though, and if we proceed we would be very careful about introducing it. I would want to resurrect the GC test suite, and I would want to first more-cleanly separate the memory management behavior from the functional behavior. That will take time and so this idea doesn't really help you today. Avoiding this issueThe memory growth occurs in your case because existing nodes are being replaced within a Document. Without knowing more about your specific code and use case, I can only offer some general ideas; but I'd be happy to go into more depth with you if you can show me code that's closer to your real user case. One idea is to treat the original document as an immutable data source, and create a second, separate document that is the "final" version. Something like: original = Nokogiri::XML(File.read('/tmp/test.xml'))
output = Nokogiri::XML("<root></root>")
original.children.map do |c|
output.root.add_child(c)
end Another idea is to periodically "dup" the document, which only carries over the DOM nodes, and allows GC to free the unparented nodes: n = Nokogiri::XML(File.read('/tmp/test.xml'))
500000.times do |j|
n.children.map do |c|
c.inner_html = c.inner_html
n = n.dup if j % 100 == 0 # ← clone the document every 100 iterations
end
end If neither of these ideas is applicable, then I'd like to ask for a more specific use case so that I can be more helpful! |
Thank you @flavorjones. Our app is using the The I think an improvement could be made to that gem, as it appears to call Your suggestion about duping the document is interesting -- the I will keep digging and post back if I can think of anything further. Thanks! |
Yes - minimizing calls to |
Closing, but please do let me know if you need any help or advice. |
Thanks for your advice and for documenting the memory model in more detail here. Not sure I am smart enough yet with this code to attempt a PR but will do so if something becomes clearer to me. I find it concerning that a simple node replacement causes such a memory issue. Seems memory should be freed when the node is replaced if there is no variable to hold it in scope. |
Yes, but as I tried to explain above, this is challenging to determine given the current memory management model. I agree this needs to be improved and in fact said so above, at length, in the section of my response entitled "Doing Better." |
Please describe the bug
We are hunting down an issue in a Ruby on Rails application that uses Nokogiri to substitute some XML elements in the content.xml of a MS Word document.
Calling
inner_html=
repetitvely appears to have memory consequences that appear as a leak. The process memory continues to increase. This can be tested by simply setting inner_html to itself.Help us reproduce what you're seeing
Pseudocode:
inner_html
to replace the node's content.We did this in our application IRB in development which includes the memory_profiler gem and nokogiri 1.12.5.
Expected behavior
We would expect that memory would not continue to increase with each call to
inner_html=
.It seems that Nokogiri is not freeing up memory as it parses the string given to inner_html, or as it replaces the node's contents.
See attached XML file that we used for testing. Our appilication performs a similar work flow on a much larger XML document so the memory consumption issue grows much faster.
Environment
loremtest.zip
The text was updated successfully, but these errors were encountered: