Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing internal_subset leaks memory #1784

Closed
stevecheckoway opened this issue Aug 14, 2018 · 5 comments
Closed

Removing internal_subset leaks memory #1784

stevecheckoway opened this issue Aug 14, 2018 · 5 comments
Milestone

Comments

@stevecheckoway
Copy link
Contributor

Removing an HTML document's internal_subset leaks memory.

What's the output from nokogiri -v?

# Nokogiri (1.7.0.1)
    ---
    warnings: []
    nokogiri: 1.7.0.1
    ruby:
      version: 2.4.4
      platform: x86_64-darwin17
      description: ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-darwin17]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/steve/programming/nokogumbo/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.7.0.1/ports/x86_64-apple-darwin17.4.0/libxml2/2.9.4"
      libxslt_path: "/Users/steve/programming/nokogumbo/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.7.0.1/ports/x86_64-apple-darwin17.4.0/libxslt/1.1.29"
      libxml2_patches: []
      libxslt_patches: []
      compiled: 2.9.4
      loaded: 2.9.4

Can you provide a self-contained script that reproduces what you're seeing?

#!/usr/bin/env ruby
# encoding: utf-8
require 'nokogiri'

1_000_000.times do |i|
  doc = Nokogiri::HTML::Document.new
  doc.internal_subset.remove
end

Here're the two leaks.

Leak: 0x7fddedc00250  size=16  zone: DefaultMallocZone_0x10cdff000
        Call stack: [thread 0x7fff98b0b380]: | 0x7fff60324015 (libdyld.dylib) start | 0x10cab9f3b (ruby) main | 0x10cb0d71d (libruby.2.4.dylib) ruby_run_node | 0x10cb0d7ec (libruby.2.4.dylib) ruby_exec_internal | 0x10cc02df4 (libruby.2.4.dylib) vm_exec | 0x10cbf7158 (libruby.2.4.dylib) vm_exec_core | 0x10cc06729 (libruby.2.4.dylib) vm_call_cfunc | 0x10cb52dbb (libruby.2.4.dylib) int_dotimes | 0x10cbff937 (libruby.2.4.dylib) rb_yield_1 | 0x10cc0c0a2 (libruby.2.4.dylib) invoke_block_from_c_splattable | 0x10cc02df4 (libruby.2.4.dylib) vm_exec | 0x10cbf7781 (libruby.2.4.dylib) vm_exec_core | 0x10cc06729 (libruby.2.4.dylib) vm_call_cfunc | 0x10cf24e2a (nokogiri.bundle) new | 0x10cfb4e99 (nokogiri.bundle) htmlNewDoc | 0x10cfb4c5d (nokogiri.bundle) htmlNewDocNoDtD | 0x10cf818c7 (nokogiri.bundle) xmlCreateIntSubset | 0x10d016e6a (nokogiri.bundle) xmlStrdup | 0x10d016d8d (nokogiri.bundle) xmlStrndup | 0x10cb22de2 (libruby.2.4.dylib) objspace_xmalloc0 | 0x7fff604cc4c7 (libsystem_malloc.dylib) malloc | 0x7fff604cd1e1 (libsystem_malloc.dylib) malloc_zone_malloc

Leak: 0x7fddedc00450  size=48  zone: DefaultMallocZone_0x10cdff000
        Call stack: [thread 0x7fff98b0b380]: | 0x7fff60324015 (libdyld.dylib) start | 0x10cab9f3b (ruby) main | 0x10cb0d71d (libruby.2.4.dylib) ruby_run_node | 0x10cb0d7ec (libruby.2.4.dylib) ruby_exec_internal | 0x10cc02df4 (libruby.2.4.dylib) vm_exec | 0x10cbf7158 (libruby.2.4.dylib) vm_exec_core | 0x10cc06729 (libruby.2.4.dylib) vm_call_cfunc | 0x10cb52dbb (libruby.2.4.dylib) int_dotimes | 0x10cbff937 (libruby.2.4.dylib) rb_yield_1 | 0x10cc0c0a2 (libruby.2.4.dylib) invoke_block_from_c_splattable | 0x10cc02df4 (libruby.2.4.dylib) vm_exec | 0x10cbf7781 (libruby.2.4.dylib) vm_exec_core | 0x10cc06729 (libruby.2.4.dylib) vm_call_cfunc | 0x10cf24e2a (nokogiri.bundle) new | 0x10cfb4e99 (nokogiri.bundle) htmlNewDoc | 0x10cfb4c5d (nokogiri.bundle) htmlNewDocNoDtD | 0x10cf819aa (nokogiri.bundle) xmlCreateIntSubset | 0x10d016e6a (nokogiri.bundle) xmlStrdup | 0x10d016d8d (nokogiri.bundle) xmlStrndup | 0x10cb22de2 (libruby.2.4.dylib) objspace_xmalloc0 | 0x7fff604cc4c7 (libsystem_malloc.dylib) malloc | 0x7fff604cd1e1 (libsystem_malloc.dylib) malloc_zone_malloc

This is causing problems for users of Nokogumbo (see rubys/nokogumbo#20).

@flavorjones
Copy link
Member

Thanks for reporting, will take a look.

@stevecheckoway
Copy link
Contributor Author

I believe the underlying issue is that libxml2 is going to strdup two strings and these aren't freed for some reason. I haven't investigated further.

@flavorjones
Copy link
Member

I'm just back from vacation and it will take a few days to catch up on everything. Thanks for your patience.

@stevecheckoway
Copy link
Contributor Author

I believe I figured out the issue but I just realized I don't have time to write and test a fix right this moment but I want to write down what I think is going on so I don't have to redo this investigation later.

Short version:
I think the fix is to modify dealloc_node_i from

static int dealloc_node_i(xmlNodePtr key, xmlNodePtr node, xmlDocPtr doc)
{
  switch(node->type) {
  case XML_ATTRIBUTE_NODE:
    xmlFreePropList((xmlAttrPtr)node);
    break;
  case XML_NAMESPACE_DECL:
    xmlFree(node);
    break;
  default:
    if(node->parent == NULL) {
      xmlAddChild((xmlNodePtr)doc, node);
    }
  }
  return ST_CONTINUE;
}

to something like

static int dealloc_node_i(xmlNodePtr key, xmlNodePtr node, xmlDocPtr doc)
{
  switch(node->type) {
  case XML_ATTRIBUTE_NODE:
    xmlFreePropList((xmlAttrPtr)node);
    break;
  case XML_NAMESPACE_DECL:
    xmlFree(node);
    break;
  case XML_DTD_NODE:
    xmlFreeDtd(node);
    break;
  default:
    if(node->parent == NULL) {
      xmlAddChild((xmlNodePtr)doc, node);
    }
  }
  return ST_CONTINUE;
}

That is, the DTD nodes need to be explicitly freed. (Honestly, I'm not sure I understand why xmlFreeNode isn't just called on each of these nodes and let libxml2 take care of the details.)

Long version:
As I understand it, when a XML::Node is unlinked, it is placed on an unlinked list which is eventually freed when the Document is garbage collected. The dealloc_node_i function given above is called for each node on the list. For a DTD node that's been freed, it will be added to the document via xmlAddChild.

Once all of the unlinked nodes are added back to the document, the document is freed with xmlFreeDoc. The relevant portion of xmlFreeDoc is the following.

    extSubset = cur->extSubset;
    intSubset = cur->intSubset;
    if (intSubset == extSubset)
	extSubset = NULL;
    if (extSubset != NULL) {
	xmlUnlinkNode((xmlNodePtr) cur->extSubset);
	cur->extSubset = NULL;
	xmlFreeDtd(extSubset);
    }
    if (intSubset != NULL) {
	xmlUnlinkNode((xmlNodePtr) cur->intSubset);
	cur->intSubset = NULL;
	xmlFreeDtd(intSubset);
    }

    if (cur->children != NULL) xmlFreeNodeList(cur->children);

Note that the internal and external subsets are explicitly freed, if they exist. By adding the DTD node to the document, it has merely placed it in the children list which should be freed by xmlFreeNodeList. Despite what you might think from the name, this doesn't just call xmlFreeNode on each node in the list. Instead, it switches on the type of the node and selectively frees parts. And the whole thing is guarded by if (cur->type != XML_DTD_NODE). Hence the leak.

One final point, I'm not sure why xmlFreePropList is being called for the XML_ATTRIBUTE_NODE case. xmlFreeNode uses xmlFreeProp((xmlAttrPtr) cur) for that case.

@flavorjones flavorjones modified the milestones: 1.8.5, next Nov 8, 2018
flavorjones pushed a commit that referenced this issue Dec 1, 2018
DTD nodes added to a document are not freed when the document is freed.

Fixes #1784
flavorjones added a commit that referenced this issue Dec 2, 2018
which is GCing removed DTD nodes.
@flavorjones
Copy link
Member

@stevecheckoway You're probably right that we could just call xmlFreeNode and move on. I've created #1826 to investigate a bit when I have more time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants