Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Uptick of seg faults in Nokogiri v1.14.0 #2785

Closed
stanhu opened this issue Feb 4, 2023 · 19 comments
Closed

[bug] Uptick of seg faults in Nokogiri v1.14.0 #2785

stanhu opened this issue Feb 4, 2023 · 19 comments
Labels
needs/more-info topic/memory Segfaults, memory leaks, valgrind testing, etc.

Comments

@stanhu
Copy link

stanhu commented Feb 4, 2023

Please describe the bug

We have a CI job that runs a number of rspec jobs. Since upgrading to Nokogiri v1.14.0, we noticed an uptick of seg faults. In this example, it seems that the seg fault happened at the end of the test run (https://gitlab.com/gitlab-org/gitlab/-/jobs/3697213765):

<snip>
Finished in 32 minutes 18 seconds (files took 1 minute 22.6 seconds to load)
330 examples, 0 failures, 3 pending

Randomized with seed 40485

[TEST PROF INFO] Time spent in factories: 03:25.902 (10.43% of total time)
Failed to write to log, write log/workhorse-test.log: file already closed
[BUG] Segmentation fault at 0x0000000000000440
ruby 3.0.5p211 (2022-11-24 revision ba5cf0f7c5) [x86_64-linux]

-- Machine register context ------------------------------------------------
 RIP: 0x00007f5f4f202b03 RBP: 0x00007f5ec252b410 RSP: 0x00007f5ec252b3f0
 RAX: 0x0000000000000000 RBX: 0x00007f5ed6053358 RCX: 0x0000000000000001
 RDX: 0x00007f5f45cb19e0 RDI: 0x00007f5f1593a6a0 RSI: 0x0000000000000000
  R8: 0x00007f5ec252bca8  R9: 0x00007f5eb126f000 R10: 0x00007f5ea93fdca8
 R11: 0x00007f5ec1229ca8 R12: 0x0000000000000000 R13: 0x00007f5f1593a6a0
 R14: 0x0000000000000004 R15: 0x00007f5ec252cc18 EFL: 0x0000000000010206

-- C level backtrace information -------------------------------------------
/usr/local/lib/libruby.so.3.0(rb_print_backtrace+0x11) [0x7f5f4f3df59e] vm_dump.c:758
/usr/local/lib/libruby.so.3.0(rb_vm_bugreport) vm_dump.c:998
/usr/local/lib/libruby.so.3.0(rb_bug_for_fatal_signal+0xf8) [0x7f5f4f1dfb38] error.c:787
/usr/local/lib/libruby.so.3.0(sigsegv+0x55) [0x7f5f4f332645] signal.c:963
/lib/x86_64-linux-gnu/libpthread.so.0(__restore_rt+0x0) [0x7f5f4f0db140]
/usr/local/lib/libruby.so.3.0(ruby_sized_xfree+0xa) [0x7f5f4f202b03] gc.c:10929
/usr/local/lib/libruby.so.3.0(ruby_sized_xfree) gc.c:10926
/builds/gitlab-org/gitlab/vendor/ruby/3.0.0/gems/nokogiri-1.14.1-x86_64-linux/lib/nokogiri/3.0/nokogiri.so(xmlResetError+0x26) [0x7f5f45c48ad6]
/builds/gitlab-org/gitlab/vendor/ruby/3.0.0/gems/nokogiri-1.14.1-x86_64-linux/lib/nokogiri/3.0/nokogiri.so(0x7f5f45cb19f0) [0x7f5f45cb19f0]
/lib/x86_64-linux-gnu/libpthread.so.0(0x7cd1) [0x7f5f4f0cfcd1]
/lib/x86_64-linux-gnu/libpthread.so.0(0x7eba) [0x7f5f4f0cfeba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f5f4eb68a2f]
<snip>

The backtrace suggests nokogiri or libxml2 is calling xmlResetError(): https://github.com/GNOME/libxml2/blob/f507d167f1755b7eaea09fb1a44d29aab828b6d1/error.c#L873-L891.

The Ruby interpreter (v3.0.5) is patched, but gc.c:10929 line corresponds to the objspace_xfree call in https://github.com/ruby/ruby/blob/ba5cf0f7c52d4d35cc6a173c89eda98ceffa2dcf/gc.c#L10909:

void
ruby_sized_xfree(void *x, size_t size)
{
    if (x) {
	objspace_xfree(&rb_objspace, x, size);
    }
}

This might relate to the changes in #2480. I have to wonder if an error is being allocated with malloc instead of ruby_xmalloc.

Help us reproduce what you're seeing

We're not yet sure how to reproduce the seg fault. We're discussing the issue in https://gitlab.com/gitlab-org/gitlab/-/issues/390313.

Expected behavior

No seg faults.

Environment

  • Ruby 3.0.5
  • Nokogiri v1.14.1 (x86)
# nokogiri -v
# Nokogiri (1.14.1)
    ---
    warnings: []
    nokogiri:
      version: 1.14.1
      cppflags:
      - "-I/usr/local/lib/ruby/gems/3.0.0/gems/nokogiri-1.14.1-x86_64-linux/ext/nokogiri"
      - "-I/usr/local/lib/ruby/gems/3.0.0/gems/nokogiri-1.14.1-x86_64-linux/ext/nokogiri/include"
      - "-I/usr/local/lib/ruby/gems/3.0.0/gems/nokogiri-1.14.1-x86_64-linux/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.0.5
      platform: x86_64-linux
      gem_platform: x86_64-linux
      description: ruby 3.0.5p211 (2022-11-24 revision ba5cf0f7c5) [x86_64-linux]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      libxml2_path: "/usr/local/lib/ruby/gems/3.0.0/gems/nokogiri-1.14.1-x86_64-linux/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.10.3
      loaded: 2.10.3
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      datetime_enabled: true
      compiled: 1.1.37
      loaded: 1.1.37
    other_libraries:
      zlib: 1.2.13
      libgumbo: 1.0.0-nokogiri
@stanhu stanhu added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Feb 4, 2023
@flavorjones
Copy link
Member

Thanks for reporting this, @stanhu! I'll make time to take a look in the next day or two.

@flavorjones
Copy link
Member

flavorjones commented Feb 4, 2023

OK, the likely call stack here is, when the Ruby process is exiting ...

# lldb call stack
  * frame #0: 0x00007ffff35bb610 libxml2.so.2`xmlResetError(err=0x0000555555f29ac8) at error.c:874:1
    frame #1: 0x00007ffff3612962 libxml2.so.2`xmlXPathFreeContext(ctxt=0x0000555555f299e0) at xpath.c:6170:5
    frame #2: 0x00007ffff7c81a5b libruby.so.3.1`finalize_list [inlined] run_final(zombie=140737280155200, objspace=0x000055555555d140) at gc.c:4015:2
    frame #3: 0x00007ffff7c81a4a libruby.so.3.1`finalize_list(objspace=0x000055555555d140, zombie=140737280155200) at gc.c:4034
    frame #4: 0x00007ffff7c86906 libruby.so.3.1`rb_objspace_call_finalizer at gc.c:4198:2
    frame #5: 0x00007ffff7c655b0 libruby.so.3.1`rb_ec_cleanup [inlined] rb_ec_finalize(ec=<unavailable>) at eval.c:164:5
    frame #6: 0x00007ffff7c65586 libruby.so.3.1`rb_ec_cleanup(ec=<unavailable>, ex0=<unavailable>) at eval.c:256
    frame #7: 0x00007ffff7c658f4 libruby.so.3.1`ruby_run_node(n=0x00007ffff3a52710) at eval.c:321:12
    frame #8: 0x000055555555517f ruby`main(argc=<unavailable>, argv=<unavailable>) at main.c:47:9

# gdb call stack
#0  xmlResetError (err=err@entry=0x5555561d0428) at error.c:874
#1  0x00007ffff3612962 in xmlXPathFreeContext (ctxt=0x5555561d0340) at xpath.c:6170
#2  0x00007ffff7c81a5b in run_final (zombie=140737280153880, objspace=0x55555555d140) at gc.c:4015
#3  finalize_list (objspace=objspace@entry=0x55555555d140, zombie=140737280153880) at gc.c:4034
#4  0x00007ffff7c86906 in rb_objspace_call_finalizer (objspace=0x55555555d140) at gc.c:4198
#5  0x00007ffff7c655b0 in rb_ec_finalize (ec=0x55555555d8e0) at vm_core.h:1856
#6  rb_ec_cleanup (ec=ec@entry=0x55555555d8e0, ex0=<optimized out>) at eval.c:256
#7  0x00007ffff7c658f4 in ruby_run_node (n=0x7ffff3a52670) at eval.c:321
#8  0x000055555555517f in main (argc=<optimized out>, argv=<optimized out>) at ./main.c:47

which is essentially a final major GC cycle and XPath query context objects are being swept.

I can't reproduce what you're seeing but it seems like a reasonable guess that it might be the change in #2480? Since I can't reproduce I may need to ask you to do some work here. If you revert that commit (which won't revert cleanly) or alternatively use Nokogiri from, say, 9f080d0, are you able to still see these crashes happening?

(Try bundling with gem 'nokogiri', git: "https://github.com/sparklemotion/nokogiri", ref: "9f080d0d3aaddf3aae439baf259749501860f028")

I'll continue to try to reproduce what you're seeing.

@mttkay
Copy link

mttkay commented Feb 6, 2023

Thanks @flavorjones ! Another suspicion I had was this commit, which added -O2: ece08bb

This can cause problems with GC when VALUE pointers are optimized away but need to remain on the stack for the GC to see that an object is still alive. These references may now have to be explicitly marked via RB_GC_GUARD: https://docs.ruby-lang.org/en/3.0/extension_rdoc.html#label-Appendix+E.+RB_GC_GUARD+to+protect+from+premature+GC

@flavorjones
Copy link
Member

@mttkay Thanks for the suggestion. That is another possibility. Still looking for a repro.

@flavorjones
Copy link
Member

@stanhu Can you speak to the nature of the patches you're applying to your version of Ruby? Can you reproduce this with an unpatched version?

@stanhu
Copy link
Author

stanhu commented Feb 7, 2023

@flavorjones I haven't been able to reproduce the problem myself yet. I'm trying to enable more debug symbols and recompiling Nokogiri to see if we can get more data.

The patches pull in ruby/ruby#3978.

@flavorjones
Copy link
Member

@stanhu OK, going to build a Ruby with that patch applied and see if I have better luck reproducing.

@flavorjones
Copy link
Member

@stanhu Still unable to reproduce, even when running under valgrind. Could really use some help to pin this down if you're able to do it in your CI environment easily.

@flavorjones flavorjones added needs/more-info topic/memory Segfaults, memory leaks, valgrind testing, etc. and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Feb 7, 2023
@stanhu
Copy link
Author

stanhu commented Feb 8, 2023

@falvorjones I'm having a hard time reproducing this as well. I'm attempting to add -ggdb3 and -fno-omit-frame-pointer, and the problem hasn't surfaced for me. I'll try dropping these flags to see if that makes a difference. As I recall I've seen some Ruby crashes mysteriously go away with -fno-omit-frame-pointer present, presumably because it might keep the compiler from optimizing away a value on the stack. We might want to consider adding that since it's generally a good flag to have enabled for profiling (https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer).

@flavorjones
Copy link
Member

@stanhu If you can reliably find this in CI, can I ask you to bisect a bit by pinning to commits in your Gemfile? Also, can you confirm that the segfault always happens during process cleanup (and not during runtime)?

@flavorjones
Copy link
Member

@stanhu @mttkay were you able to narrow this down at all?

@mttkay
Copy link

mttkay commented Feb 14, 2023

I have not. I time-boxed this investigation and was not able to come up with a repro either.

@flavorjones
Copy link
Member

flavorjones commented Mar 1, 2023

I haven't had other reports on this and to my knowledge we haven't been able to reproduce this. Is there more evidence (like valgrind output) that pinpoints the issue to Nokogiri?

@stanhu
Copy link
Author

stanhu commented Mar 1, 2023

We only saw one seg fault relating to Nokogiri. I'll close this for now. Thanks for your help.

@stanhu stanhu closed this as completed Mar 1, 2023
@flavorjones flavorjones reopened this Apr 3, 2023
@flavorjones
Copy link
Member

Reopening this to investigate a bit more. I have a hunch about threading and atexit cleanup I want to chase.

@flavorjones
Copy link
Member

Ok, my hunch was right. Here's the analysis of what's going on.

some context

We've seen "at exit" segfaults from libxml2 before. A good analysis of this class of bug is available at #2059 but the summary is:

  • if Nokogiri configured libxml2 to use ruby's memory methods (ruby_xmalloc et al)
  • and libxml2 invoked an atexit handler to free memory
  • and that atexit handler was called after Ruby's VM had been torn down
  • then ruby_xfree would access invalid memory and potentially segfault

That earlier atexit issue had already been fixed when this issue was opened, so I didn't think too hard about the fact that this crash only seemed to happen "at exit".

the hunch

Over this past weekend, after we reopened this issue and https://gitlab.com/gitlab-org/gitlab/-/issues/390313, I noticed two things that I hadn't seen before:

  • the stack backtrace was really shallow
  • the top stack level was libc:clone, and the next few levels were from libpthread

Combined with the knowledge that this segfault only ever happens after the test suite completed ("at exit"), this caused me to start thinking about libxml2's thread support and whether libxml2 was similarly trying to clean up some per-thread memory "at exit".

the mechanics

I'll skip the mystery tour and get right to the diagnosis.

If you look for calls to xmlResetError in libxml2/threads.c, you'll see a likely candidate callsite:

https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/threads.c#L468-483

/**
 * xmlFreeGlobalState:
 * @state:  a thread global state
 *
 * xmlFreeGlobalState() is called when a thread terminates with a non-NULL
 * global state. It is is used here to reclaim memory resources.
 */
static void
xmlFreeGlobalState(void *state)
{
    xmlGlobalState *gs = (xmlGlobalState *) state;

    /* free any memory allocated in the thread's xmlLastError */
    xmlResetError(&(gs->xmlLastError));
    free(state);
}

What this function is doing is freeing up some "thread local" storage containing the most recent error that the parser encountered (see libxml2/globals.[ch] for a peek at how this is done).

This function is invoked on Linux at thread exit. It's a callback passed to pthread_key_create:

https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/threads.c#L776

    pthread_key_create(&globalkey, xmlFreeGlobalState);

After a little bit of playing around, I was able to construct a reproduction. Here's what it does:

  1. spawn a thread that calls Nokogiri to invoke libxml2's parser (either XML or HTML, doesn't matter which)
  2. ... and the parser must find at least one error in the document, which gets saved in that "thread local" storage
  3. wait 3 seconds (which is the amount of time an idle Ruby thread hangs around before exiting, see ruby/thread_pthread.c and look for THREAD_CACHE_TIME)
  4. call Kernel.exit

If you're lucky (or unlucky), the xmlFreeGlobalState callback will be invoked in the slim time interval between Ruby tearing down the VM and the process finally exiting. If this happens, you will see a segfault like this:

[BUG] Segmentation fault at 0x0000000000000440
ruby 3.0.5p211 (2022-11-24 revision 3769593990) [x86_64-linux]

-- Machine register context ------------------------------------------------
 RIP: 0x00007fcd3acf304e RBP: 0x00007fcc4c026a20 RSP: 0x00007fcb21b90de0
 RAX: 0x0000000000000000 RBX: 0x00007fcc4c026d78 RCX: 0x0000000000000031
 RDX: 0x00007fcd368d6220 RDI: 0x00007fcc4c025ee0 RSI: 0x0000000000000000
  R8: 0x00007fcb21b90de4  R9: 0x00000000000000ca R10: 0x0000000000000000
 R11: 0x0000000000000246 R12: 0x00007fcc4c025ee0 R13: 0x00007fcd3aa1bae8
 R14: 0x0000000000000004 R15: 0x00007fcb21b91b58 EFL: 0x0000000000010202

-- C level backtrace information -------------------------------------------
SEGV received in SEGV handler

the repro

Here it is:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
end

html = "<div foo='asdf>asdf</div>" # needs to have errors in it!

thread_spawn_window = 0.5 # seconds
wait_time = 3.0 - (thread_spawn_window / 2) # idle ruby threads exit after 3 seconds

start_time = Time.now

threads = []
while (Time.now - start_time) < thread_spawn_window do
  threads << Thread.new { Nokogiri::HTML4::Document.parse(html) }
end
threads.take(5).map(&:join)

sleep wait_time

On my dev machine I can usually reproduce this issue within 5-10 iterations.

You can increase your changes of triggering this by injecting the following patch into Ruby:

diff --git a/eval.c b/eval.c
index adacde9e..9965453b 100644
--- a/eval.c
+++ b/eval.c
@@ -271,6 +271,8 @@ rb_ec_cleanup(rb_execution_context_t *ec, enum ruby_tag_type ex)
 
     if (signaled) ruby_default_signal(signaled);
 
+    sleep(1);
+
     return sysex;
 }
 

which keeps the process alive for a second after the VM has been torn down.

recap

OK, to summarize again what's happening here:

  • Gitlab's test suite is spawning threads which parse XML or HTML documents that have errors in them
  • and then, sometimes, a thread cleans up its libxml2 thread-local storage by calling ruby_xfree after the Ruby VM has been torn down

This is exceedingly unlikely to happen, but Gitlab's test suite and the CI environment must have hit a sweet spot where the timing was just right to increase your chances of hitting this. Potentially, Nokogiri 1.14.0's changes to GC might have even contributed to these timing changes.

how to fix this

There are a couple of potential approaches that come to mind, and I'm not sure yet which one is best.

1. stop configuring libxml2 to use Ruby's memory management methods

We've actually been discussing making this change for performance reasons!

The tradeoff discussed in that issue is:

  • get a performance improvement by not informing Ruby about each individual memory allocation
  • which may lead to memory bloat due to not triggering the GC as often

@larskanis has suggested a potential path forward on this approach in this comment.

2. patch libxml2 to not invoke this callback after the Ruby VM has been torn down

I'm not actually sure how to do this, and the patch seems likely to be awkwardly complex. And this would still be a problem for users running Nokogiri with an unpatched libxml2 system library.

3. patch libxml2 to not save errors in thread-local memory

We don't actually need xmlLastError to implement Nokogiri's API -- I have been lazily ripping it out in favor of using the structured error handlers anyway -- so we could just rip the bandaid off, in which case we can patch libxml2 to stop storing this error.

But again, this would still be a problem for users running Nokogiri with an unpatched libxml2 system library.

4. hybrid approach

We could do a combination:

  • for users of vendored libxml2: patch libxml2 to not save errors in thread-local memory
  • for users of system libxml2: stop configuring libxml2 to use Ruby's memory management methods

...

I think I'd like to try either 1 or 4, but I need to play with it a bit to be sure.

@flavorjones
Copy link
Member

I've submitted a patch upstream to Ruby to try to address this within the interpreter:

I'm also working on a change to Nokogiri to allow users to control the memory management model used by libxml2 via an environment variable.

@liyang1473

This comment was marked as off-topic.

@flavorjones
Copy link
Member

OK, I've merged #2843 -- that will be in v1.15 which I hope to ship in the next few weeks.

maxlazio pushed a commit to gitlabhq/omnibus-gitlab that referenced this issue Jun 16, 2023
As sparklemotion/nokogiri#2785 (comment)
explains, there is a bug in the Ruby interpreter
(https://bugs.ruby-lang.org/issues/19580) that has been fixed upstream
(ruby/ruby#7663) that causes a seg fault
during shutdown with libxml2/Nokogiri.

We patched the Ruby interpreter in CI to work around the problem
(https://gitlab.com/gitlab-org/gitlab-build-images/-/merge_requests/672)
in https://gitlab.com/gitlab-org/gitlab/-/issues/390313, but it
appears the seg faults have appeared in production now. On GitLab.com,
this week we have seen more than 20 cases with the error:

```
[BUG] Segmentation fault at 0x0000000000000440
```

We could also work around this problem by setting
`NOKOGIRI_LIBXML_MEMORY_MANAGEMENT=default`, but this may cause
unexpected memory growth since Ruby no longer manages the memory (see
https://github.com/sparklemotion/nokogiri/pull/2843/files).

Let's just fix the interpreter since we'd also need to make sure that
environment variable is set in any environment that uses Nokogiri
(including Rake tasks).

Changelog: fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/more-info topic/memory Segfaults, memory leaks, valgrind testing, etc.
Projects
None yet
Development

No branches or pull requests

4 participants