Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in node_set.rb #1952

Closed
ahorek opened this issue Dec 2, 2019 · 15 comments · Fixed by #2186
Closed

segfault in node_set.rb #1952

ahorek opened this issue Dec 2, 2019 · 15 comments · Fixed by #2186
Assignees
Labels
topic/memory Segfaults, memory leaks, valgrind testing, etc.

Comments

@ahorek
Copy link

ahorek commented Dec 2, 2019

Describe the bug

/home/ahorek/.rvm/gems/ruby-head/gems/nokogiri-1.10.5/lib/nokogiri/xml/node_set.rb:238: [BUG] Segmentation fault at 0x000000000a296680
ruby 2.7.0dev (2019-11-28T14:49:28Z trunk b5fbefbf2c) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0011 p:---- s:0048 e:000047 CFUNC  :[]
c:0010 p:0006 s:0043 e:000042 BLOCK  /home/ahorek/.rvm/gems/ruby-head/gems/nokogiri-1.10.5/lib/nokogiri/xml/node_set.rb:238 [FINISH]
c:0009 p:---- s:0039 e:000038 CFUNC  :upto
c:0008 p:0019 s:0034 e:000033 METHOD /home/ahorek/.rvm/gems/ruby-head/gems/nokogiri-1.10.5/lib/nokogiri/xml/node_set.rb:237
c:0007 p:0037 s:0030 e:000029 METHOD /home/ahorek/.rvm/gems/ruby-head/gems/nokogiri-1.10.5/lib/nokogiri/xml/node.rb:261
c:0006 p:0036 s:0025 e:000023 BLOCK  seg.rb:13 [FINISH]
c:0005 p:---- s:0021 e:000020 CFUNC  :times
c:0004 p:0055 s:0017 e:000016 BLOCK  seg.rb:12 [FINISH]
c:0003 p:---- s:0010 e:000009 CFUNC  :times
c:0002 p:0014 s:0006 e:000005 EVAL   seg.rb:7 [FINISH]
c:0001 p:0000 s:0003 E:002200 (none) [FINISH]

-- Ruby level backtrace information ----------------------------------------
seg.rb:7:in `<main>'
seg.rb:7:in `times'
seg.rb:12:in `block in <main>'
seg.rb:12:in `times'
seg.rb:13:in `block (2 levels) in <main>'
/home/ahorek/.rvm/gems/ruby-head/gems/nokogiri-1.10.5/lib/nokogiri/xml/node.rb:261:in `children='
/home/ahorek/.rvm/gems/ruby-head/gems/nokogiri-1.10.5/lib/nokogiri/xml/node_set.rb:237:in `each'
/home/ahorek/.rvm/gems/ruby-head/gems/nokogiri-1.10.5/lib/nokogiri/xml/node_set.rb:237:in `upto'
/home/ahorek/.rvm/gems/ruby-head/gems/nokogiri-1.10.5/lib/nokogiri/xml/node_set.rb:238:in `block in each'
/home/ahorek/.rvm/gems/ruby-head/gems/nokogiri-1.10.5/lib/nokogiri/xml/node_set.rb:238:in `[]'
...

To Reproduce

require 'nokogiri'

def roundup(num, nearest = 10)
  num % nearest == 0 ? num : num + nearest - (num % nearest)
end

1000.times do |i|
  response = File.read('xml.xml')
  response2 = response.dup
  x = Nokogiri::XML.parse(response)
  x.root.content = ''
  140.times do
    x.root.children += Nokogiri::XML.parse(response2).root.children
  end
  puts i
end
puts 'ok!'

xml input
https://github.com/ahorek/nokogiri/blob/bug/bug/xml.xml

Expected behavior
no segfaults

Environment
latest nokogiri-1.10.5

ruby 2.7.0dev (2019-11-28T14:49:28Z trunk b5fbefbf2c) [x86_64-linux]

tested on multiple ruby versions 2.5 - 2.7, linux / windows

the script doesn't fail on jruby

@flavorjones
Copy link
Member

Hi, thank you for reporting this. I'll attempt to reproduce and diagnose as soon as I can (hopefully today).

@flavorjones
Copy link
Member

OK, I've reproduced using your script and XML. Currently trying to reproduce that while running under Valgrind to get more precise information about the operation causing the segfault.

@flavorjones
Copy link
Member

OK - I captured the segfault in valgrind and have a nice stack walkback. Will dig in.

@flavorjones
Copy link
Member

flavorjones commented Dec 3, 2019

Here's the top of the valgrind output, cleaned up a bit for clarity:

==27690== Invalid read of size 4
==27690==    at 0xB37CED0: Nokogiri_wrap_xml_node_set_node (xml_node_set.c:424)
==27690==    by 0xB37D584: index_at (xml_node_set.c:236)
==27690==    by 0xB37D584: slice (xml_node_set.c:312)
==27690==    by 0x50924E9: vm_call_cfunc_with_frame (vm_insnhelper.c:1908)
==27690==    by 0x50924E9: vm_call_cfunc (vm_insnhelper.c:1924)
==27690==    by 0x509BEF1: vm_exec_core (insns.def:765)
==27690==    by 0x50A290E: rb_vm_exec (vm.c:1885)
...
==27690==  Address 0xee0e5b8 is 8 bytes inside a block of size 120 free'd
==27690==    at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27690==    by 0x4F19EDC: objspace_xfree (gc.c:8309)
==27690==    by 0x4F19EDC: ruby_sized_xfree (gc.c:8405)
==27690==    by 0x4F19EDC: ruby_xfree (gc.c:8412)
==27690==    by 0xB3CE9EB: xmlFreeNodeList (in /home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.5/lib/nokogiri/nokogiri.so)
==27690==    by 0xB3CAA11: xmlFreeDoc (in /home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.5/lib/nokogiri/nokogiri.so)
==27690==    by 0x4F10BF0: run_final (gc.c:2917)
==27690==    by 0x4F10BF0: finalize_list (gc.c:2936)
==27690==    by 0x4F10CE7: finalize_deferred (gc.c:2956)
==27690==    by 0x4F10CE7: gc_finalize_deferred (gc.c:2965)
...
==27690==  Block was alloc'd at
==27690==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27690==    by 0x4F1C8F3: objspace_xmalloc0 (gc.c:8134)
==27690==    by 0xB4C2144: xmlSAX2TextNode (in /home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.5/lib/nokogiri/nokogiri.so)
==27690==    by 0xB4C3BBF: xmlSAX2Characters (in /home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.5/lib/nokogiri/nokogiri.so)
==27690==    by 0xB3AB1E4: xmlParseCharData (in /home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.5/lib/nokogiri/nokogiri.so)
==27690==    by 0xB3BB282: xmlParseContent (in /home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.5/lib/nokogiri/nokogiri.so)
==27690==    by 0xB3BB3D1: xmlParseElement (in /home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.5/lib/nokogiri/nokogiri.so)
==27690==    by 0xB3BDD86: xmlParseDocument (in /home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.5/lib/nokogiri/nokogiri.so)
==27690==    by 0xB3C78F9: xmlDoRead (in /home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.5/lib/nokogiri/nokogiri.so)
==27690==    by 0xB3C7A90: xmlReadMemory (in /home/flavorjones/.rvm/gems/ruby-2.6.2/gems/nokogiri-1.10.5/lib/nokogiri/nokogiri.so)
==27690==    by 0xB3785E4: read_memory (xml_document.c:294)
==27690==    by 0x50924E9: vm_call_cfunc_with_frame (vm_insnhelper.c:1908)
==27690==    by 0x50924E9: vm_call_cfunc (vm_insnhelper.c:1924)
==27690==    by 0x509BEF1: vm_exec_core (insns.def:765)
==27690==    by 0x50A290E: rb_vm_exec (vm.c:1885)
...

I need to investigate, but essentially what this is telling me is that we still have an active pointer to memory that was free()ed by libxml2. And that's bad.

@flavorjones flavorjones self-assigned this Dec 4, 2019
@flavorjones
Copy link
Member

Sorry for lack of updates - this is high priority, but I haven't been able to spend much time on it in the past week. (The difficulty in quickly/reliably reproducing it is an additional challenge.)

@flavorjones flavorjones added the topic/memory Segfaults, memory leaks, valgrind testing, etc. label Feb 2, 2020
@samsonnguyen
Copy link

I've encountered this same exact issue with nokogiri 1.10.9 and ruby 2.6.3

@flavorjones
Copy link
Member

@samsonnguyen I'm sorry you're experience this issue!

Are you able to easily reproduce this? It would greatly help if I had a repro that I could use to get faster feedback loops.

@flavorjones
Copy link
Member

@ahorek Thanks for the PR with the repro, I'll dig in.

@stayhero
Copy link

I have come across a segmentation fault which I think is related to this issue. It happens when I merge two nodesets like this:

require "nokogiri"

XML_PAYLOAD = <<-EOF 
    <?xml version="1.0" encoding="UTF-8"?>
    <container>
    </container>
EOF

class ParseAndMerge
    def merge_items
        nodeset1 = Nokogiri::XML(XML_PAYLOAD).xpath("//container")
        nodeset2 = Nokogiri::XML(XML_PAYLOAD).xpath("//container")
        nodeset1 + nodeset2
    end
end

GC.stress = true 
100.times do 
    p = ParseAndMerge.new
    merged_nodeset = p.merge_items
    puts merged_nodeset.to_s
end 

Happens with 1.10.9, 1.10.10, 1.11.0.rc3. nokogiri -v:

  nokogiri_segfault_test nokogiri -v
# Nokogiri (1.11.0.rc3)
    ---
    warnings: []
    nokogiri: 1.11.0.rc3
    ruby:
      version: 2.6.6
      platform: x86_64-darwin19
      gem_platform: x86_64-darwin-19
      description: ruby 2.6.6p146 (2020-03-31 revision 67876) [x86_64-darwin19]
      engine: ruby
    libxml:
      source: packaged
      patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      compiled: 2.9.10
      loaded: 2.9.10
    libxslt:
      source: packaged
      patches: []
      compiled: 1.1.34
      loaded: 1.1.34

It does not happen when I change above code to prevent ruby from garbage collecting the nodeset2 (nodeset1.+(nodeset2)), like this:

require "nokogiri"

XML_PAYLOAD = <<-EOF 
    <?xml version="1.0" encoding="UTF-8"?>
    <container>
    </container>
EOF

$nodeset_list = []
class ParseAndMerge
    def merge_items
        nodeset1 = Nokogiri::XML(XML_PAYLOAD).xpath("//container")
        nodeset2 = Nokogiri::XML(XML_PAYLOAD).xpath("//container")
        $nodeset_list << [nodeset2] # Fix: Prevent ruby from garbage collecting nodeset2
        nodeset1 + nodeset2
    end
end

GC.stress = true 
100.times do 
    p = ParseAndMerge.new
    merged_nodeset = p.merge_items
    puts merged_nodeset.to_s
end 

@flavorjones
Copy link
Member

@stayhero Thanks for reporting your issue. I don't have enough information yet to determine if you're seeing the same problem or not. Hoping to be able to dedicate some time to this memory issue soon.

@flavorjones
Copy link
Member

@stayhero Sorry for the delay in giving you a meaningful response. After digging into the example you've given, I'm confident that this is the same bug. Thanks so much for providing this second example which has helped illuminate what's going on.

@flavorjones
Copy link
Member

I'm going to try to explain how GC works with libxml2's data structures, what's going wrong in these examples to cause a segfault, and some ideas for resolving the problems.

How GC works for libxml2 data structures

Garbage collection ("GC") of libxml2 data structures is controlled by the Document object's lifecycle -- the libxml2 function xmlFreeDoc frees all the node data structures owned by the document, and so we take some pains to make sure that Document and its related Node objects are all GCed at the same time.

The Document object keeps a reference to every Node object in a "node cache", even if the node has been unparented from the DOM tree. Every Node object keeps a reference back to its parent Document in the @document attribute. So if either the Document or any of its Nodes is still referenced, then their mutual references to each other keep them from being GCed.

Additionally, a NodeSet keeps a reference to its Document, too -- this is a libxml2 implementation detail that doesn't have any major implications to the Nokogiri user (primarily, it's the module decorators, which allow library users to extend the Node and NodeSet interface). But it's worth noting that if a NodeSet is still referenced, then its reference to the Document keeps that document alive.

Notably, what NodeSet does not do is mark each of the nodes it contains during the GC "mark" phase. This may be surprising to you if you know how all the standard Ruby collection classes work. This behavior was adopted for NodeSet because we assume that it only contains Nodes belonging to the same Document as itself, and so marking is unnecessary and would only slow GC.

What's Going Wrong

In these examples, we have two sets of Nodes, each belonging to a different parent Document and contained by two separate NodeSets. The NodeSets are merged using #+ which invokes the set_union function in xml_node_set.c. This function creates a new xmlNodeSet associated with Document of the lhs's NodeSet, and adds the Nodes from the rhs into it. So now we have a NodeSet referencing only Document A which contains Nodes from both Document A and Document B.

I mentioned above that NodeSet does not mark the Nodes it contains because it assumes they all belong to the same Document as itself, and so marking is unnecessary. In these examples, that's an invalid assumption.

In these examples, when GC kicks in, nothing belonging to Document B is marked -- there are no references to Document B or its NodeSet, and the only references to the Nodes are in the NodeSet belonging to Document A, which does not mark them during GC. As a result, Document B, NodeSet B, and all the nodes belonging to it are mistakenly considered "no longer referenced" and GCed and freed.

Later, we try to print out the contents of the NodeSet and dereference invalid pointers, causing a segfault.

Thoughts on Resolution

Two solutions come to mind:

  1. ensure that a NodeSet only contains Nodes parented by the its Document, duplicating the Node when necessary to ensure this property
  2. ensure that NodeSets mark their contents, similar to Array and other standard Ruby container classes.

The disadvantages of option (2) are:

  • in edge cases it has the potential to keep references alive to many Documents and preventing them -- and any other Nodes associated with them -- from being GCed
  • GC will be slower for NodeSet than currently

which leads me to conclude that option (1) is the better choice at this time.

And a Crazy Idea

However, there is a third option which I'd like to explore:

  1. rewrite the NodeSet class to be a subclass of Array, and to be independent of any specific Document

which would greatly simplify the implementation, but also bring with it the disadvantages of option (2) above.

Next Steps

I'm going to implement (1), but also create a separate issue to drive exploration of (3).

@flavorjones
Copy link
Member

I've opened an issue for exploration of the crazy idea at #2184. I'll work on a less risky fix for this bug this week.

@flavorjones
Copy link
Member

Blurgh, option (1) above breaks the semantics of the NodeSet in that the duplicate we'd create would be unparented from the original document -- which is not going to make any sense for users who end up using NodeSets to collect the results of multiple queries across documents.

I think we need to implement option (2) and accept a slightly slower mark phase. On the plus side, this is exactly what NodeSet-as-subclass-of-Array would do if we decide to eventually reimplement the class.

flavorjones added a commit that referenced this issue Feb 2, 2021
Specifically, this means we've added a mark callback for the NodeSet
which marks each of the contained objects.

Previously we skipped explicitly marking the contained objects due to
the assumption that all the nodes would be from the same document as
the NodeSet itself.

Fixes #1952.
@flavorjones
Copy link
Member

PR created at #2186.

flavorjones added a commit that referenced this issue Feb 5, 2021
Specifically, this means we've added a mark callback for the NodeSet
which marks each of the contained objects.

Previously we skipped explicitly marking the contained objects due to
the assumption that all the nodes would be from the same document as
the NodeSet itself.

Fixes #1952.
flavorjones added a commit that referenced this issue Feb 5, 2021
Specifically, this means we've added a mark callback for the NodeSet
which marks each of the contained objects.

Previously we skipped explicitly marking the contained objects due to
the assumption that all the nodes would be from the same document as
the NodeSet itself.

Fixes #1952.
flavorjones added a commit that referenced this issue Mar 21, 2021
Since fixing #1952 I've wanted to revisit the valgrind suppressions to
see what's left. These suppressions represent what I saw in docker
images on my dev machine:

- on Ruby 2.7 and 3.0 startup (iseq_peephole_optimize)
- enumerators seem to confuse Valgrind (mark_locations_array/gc_mark_stacked_objects)

ci: add 2.7 valgrind suppressions to ignore enumerator warnings

squash
flavorjones added a commit that referenced this issue Mar 21, 2021
Since fixing #1952 I've wanted to revisit the valgrind suppressions to
see what's left. These suppressions represent what I saw in docker
images on my dev machine:

- on Ruby 2.7 and 3.0 startup (iseq_peephole_optimize)
- enumerators seem to confuse Valgrind (mark_locations_array/gc_mark_stacked_objects)
flavorjones added a commit that referenced this issue Mar 21, 2021
Since fixing #1952 I've wanted to revisit the valgrind suppressions to
see what's left. These suppressions represent what I saw in docker
images on my dev machine:

- on Ruby 2.7 and 3.0 startup (iseq_peephole_optimize)
- enumerators seem to confuse Valgrind (mark_locations_array/gc_mark_stacked_objects)
flavorjones added a commit that referenced this issue Mar 21, 2021
Since fixing #1952 I've wanted to revisit the valgrind suppressions to
see what's left. These suppressions represent what I saw in docker
images on my dev machine:

- on Ruby 2.7 and 3.0 startup (iseq_peephole_optimize)
- enumerators seem to confuse Valgrind (mark_locations_array/gc_mark_stacked_objects)
flavorjones added a commit that referenced this issue Mar 22, 2021
Since fixing #1952 I've wanted to revisit the valgrind suppressions to
see what's left. I'm leaving only the suppressions I see for Ruby 2.7
and 3.0 startup (iseq_peephole_optimize).
flavorjones added a commit that referenced this issue Mar 22, 2021
…or-in-valgrind

ci: skip Nodeset enumerator test in valgrind

---

Since fixing #1952 I've wanted to revisit the valgrind suppressions to see what's left. I'm leaving only the suppressions I see for Ruby 2.7 and 3.0 startup (iseq_peephole_optimize), and I'm skipping the NodeSet enumerator test since I know it confuses valgrind.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/memory Segfaults, memory leaks, valgrind testing, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants