New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Uptick of seg faults in Nokogiri v1.14.0 #2785
Comments
Thanks for reporting this, @stanhu! I'll make time to take a look in the next day or two. |
OK, the likely call stack here is, when the Ruby process is exiting ...
which is essentially a final major GC cycle and XPath query context objects are being swept. I can't reproduce what you're seeing but it seems like a reasonable guess that it might be the change in #2480? Since I can't reproduce I may need to ask you to do some work here. If you revert that commit (which won't revert cleanly) or alternatively use Nokogiri from, say, 9f080d0, are you able to still see these crashes happening? (Try bundling with I'll continue to try to reproduce what you're seeing. |
Thanks @flavorjones ! Another suspicion I had was this commit, which added This can cause problems with GC when |
@mttkay Thanks for the suggestion. That is another possibility. Still looking for a repro. |
@stanhu Can you speak to the nature of the patches you're applying to your version of Ruby? Can you reproduce this with an unpatched version? |
@flavorjones I haven't been able to reproduce the problem myself yet. I'm trying to enable more debug symbols and recompiling Nokogiri to see if we can get more data. The patches pull in ruby/ruby#3978. |
@stanhu OK, going to build a Ruby with that patch applied and see if I have better luck reproducing. |
@stanhu Still unable to reproduce, even when running under valgrind. Could really use some help to pin this down if you're able to do it in your CI environment easily. |
@falvorjones I'm having a hard time reproducing this as well. I'm attempting to add |
@stanhu If you can reliably find this in CI, can I ask you to bisect a bit by pinning to commits in your Gemfile? Also, can you confirm that the segfault always happens during process cleanup (and not during runtime)? |
I have not. I time-boxed this investigation and was not able to come up with a repro either. |
I haven't had other reports on this and to my knowledge we haven't been able to reproduce this. Is there more evidence (like valgrind output) that pinpoints the issue to Nokogiri? |
We only saw one seg fault relating to Nokogiri. I'll close this for now. Thanks for your help. |
Reopening this to investigate a bit more. I have a hunch about threading and |
Ok, my hunch was right. Here's the analysis of what's going on. some contextWe've seen "at exit" segfaults from libxml2 before. A good analysis of this class of bug is available at #2059 but the summary is:
That earlier the hunchOver this past weekend, after we reopened this issue and https://gitlab.com/gitlab-org/gitlab/-/issues/390313, I noticed two things that I hadn't seen before:
Combined with the knowledge that this segfault only ever happens after the test suite completed ("at exit"), this caused me to start thinking about libxml2's thread support and whether libxml2 was similarly trying to clean up some per-thread memory "at exit". the mechanicsI'll skip the mystery tour and get right to the diagnosis. If you look for calls to https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/threads.c#L468-483 /**
* xmlFreeGlobalState:
* @state: a thread global state
*
* xmlFreeGlobalState() is called when a thread terminates with a non-NULL
* global state. It is is used here to reclaim memory resources.
*/
static void
xmlFreeGlobalState(void *state)
{
xmlGlobalState *gs = (xmlGlobalState *) state;
/* free any memory allocated in the thread's xmlLastError */
xmlResetError(&(gs->xmlLastError));
free(state);
} What this function is doing is freeing up some "thread local" storage containing the most recent error that the parser encountered (see This function is invoked on Linux at thread exit. It's a callback passed to
pthread_key_create(&globalkey, xmlFreeGlobalState); After a little bit of playing around, I was able to construct a reproduction. Here's what it does:
If you're lucky (or unlucky), the [BUG] Segmentation fault at 0x0000000000000440
ruby 3.0.5p211 (2022-11-24 revision 3769593990) [x86_64-linux]
-- Machine register context ------------------------------------------------
RIP: 0x00007fcd3acf304e RBP: 0x00007fcc4c026a20 RSP: 0x00007fcb21b90de0
RAX: 0x0000000000000000 RBX: 0x00007fcc4c026d78 RCX: 0x0000000000000031
RDX: 0x00007fcd368d6220 RDI: 0x00007fcc4c025ee0 RSI: 0x0000000000000000
R8: 0x00007fcb21b90de4 R9: 0x00000000000000ca R10: 0x0000000000000000
R11: 0x0000000000000246 R12: 0x00007fcc4c025ee0 R13: 0x00007fcd3aa1bae8
R14: 0x0000000000000004 R15: 0x00007fcb21b91b58 EFL: 0x0000000000010202
-- C level backtrace information -------------------------------------------
SEGV received in SEGV handler the reproHere it is: #! /usr/bin/env ruby
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri"
end
html = "<div foo='asdf>asdf</div>" # needs to have errors in it!
thread_spawn_window = 0.5 # seconds
wait_time = 3.0 - (thread_spawn_window / 2) # idle ruby threads exit after 3 seconds
start_time = Time.now
threads = []
while (Time.now - start_time) < thread_spawn_window do
threads << Thread.new { Nokogiri::HTML4::Document.parse(html) }
end
threads.take(5).map(&:join)
sleep wait_time On my dev machine I can usually reproduce this issue within 5-10 iterations. You can increase your changes of triggering this by injecting the following patch into Ruby: diff --git a/eval.c b/eval.c
index adacde9e..9965453b 100644
--- a/eval.c
+++ b/eval.c
@@ -271,6 +271,8 @@ rb_ec_cleanup(rb_execution_context_t *ec, enum ruby_tag_type ex)
if (signaled) ruby_default_signal(signaled);
+ sleep(1);
+
return sysex;
}
which keeps the process alive for a second after the VM has been torn down. recapOK, to summarize again what's happening here:
This is exceedingly unlikely to happen, but Gitlab's test suite and the CI environment must have hit a sweet spot where the timing was just right to increase your chances of hitting this. Potentially, Nokogiri 1.14.0's changes to GC might have even contributed to these timing changes. how to fix thisThere are a couple of potential approaches that come to mind, and I'm not sure yet which one is best. 1. stop configuring libxml2 to use Ruby's memory management methods We've actually been discussing making this change for performance reasons! The tradeoff discussed in that issue is:
@larskanis has suggested a potential path forward on this approach in this comment. 2. patch libxml2 to not invoke this callback after the Ruby VM has been torn down I'm not actually sure how to do this, and the patch seems likely to be awkwardly complex. And this would still be a problem for users running Nokogiri with an unpatched libxml2 system library. 3. patch libxml2 to not save errors in thread-local memory We don't actually need But again, this would still be a problem for users running Nokogiri with an unpatched libxml2 system library. 4. hybrid approach We could do a combination:
... I think I'd like to try either 1 or 4, but I need to play with it a bit to be sure. |
I've submitted a patch upstream to Ruby to try to address this within the interpreter:
I'm also working on a change to Nokogiri to allow users to control the memory management model used by libxml2 via an environment variable. |
This comment was marked as off-topic.
This comment was marked as off-topic.
OK, I've merged #2843 -- that will be in v1.15 which I hope to ship in the next few weeks. |
As sparklemotion/nokogiri#2785 (comment) explains, there is a bug in the Ruby interpreter (https://bugs.ruby-lang.org/issues/19580) that has been fixed upstream (ruby/ruby#7663) that causes a seg fault during shutdown with libxml2/Nokogiri. We patched the Ruby interpreter in CI to work around the problem (https://gitlab.com/gitlab-org/gitlab-build-images/-/merge_requests/672) in https://gitlab.com/gitlab-org/gitlab/-/issues/390313, but it appears the seg faults have appeared in production now. On GitLab.com, this week we have seen more than 20 cases with the error: ``` [BUG] Segmentation fault at 0x0000000000000440 ``` We could also work around this problem by setting `NOKOGIRI_LIBXML_MEMORY_MANAGEMENT=default`, but this may cause unexpected memory growth since Ruby no longer manages the memory (see https://github.com/sparklemotion/nokogiri/pull/2843/files). Let's just fix the interpreter since we'd also need to make sure that environment variable is set in any environment that uses Nokogiri (including Rake tasks). Changelog: fixed
Please describe the bug
We have a CI job that runs a number of rspec jobs. Since upgrading to Nokogiri v1.14.0, we noticed an uptick of seg faults. In this example, it seems that the seg fault happened at the end of the test run (https://gitlab.com/gitlab-org/gitlab/-/jobs/3697213765):
The backtrace suggests
nokogiri
orlibxml2
is callingxmlResetError()
: https://github.com/GNOME/libxml2/blob/f507d167f1755b7eaea09fb1a44d29aab828b6d1/error.c#L873-L891.The Ruby interpreter (v3.0.5) is patched, but
gc.c:10929
line corresponds to theobjspace_xfree
call in https://github.com/ruby/ruby/blob/ba5cf0f7c52d4d35cc6a173c89eda98ceffa2dcf/gc.c#L10909:This might relate to the changes in #2480. I have to wonder if an error is being allocated with
malloc
instead ofruby_xmalloc
.Help us reproduce what you're seeing
We're not yet sure how to reproduce the seg fault. We're discussing the issue in https://gitlab.com/gitlab-org/gitlab/-/issues/390313.
Expected behavior
No seg faults.
Environment
The text was updated successfully, but these errors were encountered: