Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent segmentation faults after upgrading to 2.5.3 #251

Closed
aogata-inst opened this issue Sep 15, 2020 · 15 comments
Closed

intermittent segmentation faults after upgrading to 2.5.3 #251

aogata-inst opened this issue Sep 15, 2020 · 15 comments
Assignees

Comments

@aogata-inst
Copy link

What operating system + architecture are you using?
Alpine 3.11
Ubuntu 18.04 (Bionic)

Ruby 2.6

What version of nio4r are you using?
v2.5.3

What did you do?
This error happens intermittently while running selenium tests through rspec. We’re importing nio4r through the “puma” gem. This incident began occurring immediately after updating nio4r from 2.5.2 to 2.5.3.

This version included this commit (507c199) which appears to be related.

What error did you see?
Segmentation Fault Examples

/.../activesupport-5.2.4.3/lib/active_support/notifications/instrumenter.rb:25: [BUG] object allocation during garbage collection phase
/.../nokogiri-1.10.9/lib/nokogiri/xml/node.rb:767: [BUG] Segmentation fault at 0x0000000000000000
/.../nokogiri-1.10.9/lib/nokogiri/xml/node.rb:767: [BUG] Segmentation fault at 0x0000000000000080
/.../nokogiri-1.10.9/lib/nokogiri/xml/node.rb:767: [BUG] Segmentation fault at 0x00007f42744d464b
/.../nokogiri-1.10.9/lib/nokogiri/html/document.rb:205: [BUG] Segmentation fault at 0x0000000000000000
/.../nokogiri-1.10.9/lib/nokogiri/html/document.rb:205: [BUG] Segmentation fault at 0x0000000000000080
/.../nokogiri-1.10.9/lib/nokogiri/html/document.rb:205: [BUG] Segmentation fault at 0x00000000000000a0
/.../nokogiri-1.10.9/lib/nokogiri/xml/document.rb:219: [BUG] Segmentation fault at 0x0000000000000000
/.../puma-4.3.3/lib/puma/reactor.rb:166: [BUG] Segmentation fault at 0x0000000000000000
/.../nokogiri-1.10.9/lib/nokogiri/xml/searchable.rb:190: [BUG] Segmentation fault at 0x0000000000000080

Example Stacktrace (from [BUG] object allocation during garbage collection phase)

[2020-09-09T18:23:27.748Z] -- Control frame information -----------------------------------------------
[2020-09-09T18:23:27.748Z] c:0067 p:---- s:0469 e:000468 CFUNC :name
[2020-09-09T18:23:27.748Z] c:0066 p:0029 s:0465 e:000462 RESCUE /home/docker/.gem/2.6.6/gems/activesupport-5.2.4.3/lib/active_support/notifications/instrumenter.rb:25
[2020-09-09T18:23:27.748Z] c:0065 p:0019 s:0459 e:000458 METHOD /home/docker/.gem/2.6.6/gems/activesupport-5.2.4.3/lib/active_support/notifications/instrumenter.rb:24
[2020-09-09T18:23:27.748Z] c:0064 p:0027 s:0451 e:000450 METHOD /home/docker/.gem/2.6.6/gems/activesupport-5.2.4.3/lib/active_support/notifications.rb:168
[2020-09-09T18:23:27.748Z] c:0063 p:0076 s:0445 e:000444 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/abstract_renderer.rb:43
[2020-09-09T18:23:27.748Z] c:0062 p:0013 s:0439 e:000438 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/partial_renderer.rb:319
[2020-09-09T18:23:27.748Z] c:0061 p:0083 s:0435 e:000434 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/partial_renderer.rb:310
[2020-09-09T18:23:27.748Z] c:0060 p:0023 s:0428 e:000427 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/renderer.rb:49
[2020-09-09T18:23:27.748Z] c:0059 p:0017 s:0421 e:000420 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/renderer.rb:23
[2020-09-09T18:23:27.748Z] c:0058 p:0082 s:0415 e:000414 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/helpers/rendering_helper.rb:33
[2020-09-09T18:23:27.749Z] c:0057 p:0010 s:0408 e:000407 BLOCK /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/debug_exceptions.rb:46
[2020-09-09T18:23:27.749Z] c:0056 p:0033 s:0405 e:000404 METHOD /home/docker/.gem/2.6.6/gems/activesupport-5.2.4.3/lib/active_support/logger_silence.rb:21
[2020-09-09T18:23:27.749Z] c:0055 p:0033 s:0399 e:000398 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/debug_exceptions.rb:46
[2020-09-09T18:23:27.749Z] c:0054 p:0013 s:0393 e:000391 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/routing/inspector.rb:198
[2020-09-09T18:23:27.749Z] c:0053 p:0064 s:0387 e:000386 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/routing/inspector.rb:73
[2020-09-09T18:23:27.749Z] c:0052 p:0195 s:0379 e:000377 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/templates/rescues/routing_error.html.erb:28
[2020-09-09T18:23:27.749Z] c:0051 p:0025 s:0370 e:000369 BLOCK /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/template.rb:159
[2020-09-09T18:23:27.749Z] c:0050 p:0039 s:0367 e:000366 METHOD /home/docker/.gem/2.6.6/gems/activesupport-5.2.4.3/lib/active_support/notifications.rb:170
[2020-09-09T18:23:27.749Z] c:0049 p:0024 s:0361 e:000360 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/template.rb:354
[2020-09-09T18:23:27.749Z] c:0048 p:0008 s:0356 e:000355 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/template.rb:157
[2020-09-09T18:23:27.749Z] c:0047 p:0051 s:0347 e:000346 METHOD /usr/src/app/config/initializers/i18n.rb:264
[2020-09-09T18:23:27.749Z] c:0046 p:0050 s:0340 e:000339 METHOD /usr/src/app/config/initializers/template_streaming.rb:114
[2020-09-09T18:23:27.749Z] c:0045 p:0014 s:0333 e:000332 BLOCK /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/template_renderer.rb:54
[2020-09-09T18:23:27.749Z] c:0044 p:0005 s:0330 e:000329 BLOCK /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/abstract_renderer.rb:44
[2020-09-09T18:23:27.749Z] c:0043 p:0011 s:0326 e:000325 BLOCK /home/docker/.gem/2.6.6/gems/activesupport-5.2.4.3/lib/active_support/notifications.rb:168
[2020-09-09T18:23:27.749Z] c:0042 p:0018 s:0323 e:000322 METHOD /home/docker/.gem/2.6.6/gems/activesupport-5.2.4.3/lib/active_support/notifications/instrumenter.rb:23
[2020-09-09T18:23:27.749Z] c:0041 p:0027 s:0315 e:000314 METHOD /home/docker/.gem/2.6.6/gems/activesupport-5.2.4.3/lib/active_support/notifications.rb:168
[2020-09-09T18:23:27.749Z] c:0040 p:0076 s:0309 e:000308 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/abstract_renderer.rb:43
[2020-09-09T18:23:27.749Z] c:0039 p:0020 s:0303 e:000302 BLOCK /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/template_renderer.rb:53
[2020-09-09T18:23:27.749Z] c:0038 p:0032 s:0299 e:000298 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/template_renderer.rb:61
[2020-09-09T18:23:27.749Z] c:0037 p:0030 s:0290 e:000289 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/template_renderer.rb:52
[2020-09-09T18:23:27.749Z] c:0036 p:0093 s:0282 e:000281 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/template_renderer.rb:16
[2020-09-09T18:23:27.749Z] c:0035 p:0020 s:0275 e:000274 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/renderer.rb:44
[2020-09-09T18:23:27.749Z] c:0034 p:0026 s:0269 e:000268 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/renderer/renderer.rb:25
[2020-09-09T18:23:27.749Z] c:0033 p:0082 s:0263 e:000262 METHOD /home/docker/.gem/2.6.6/gems/actionview-5.2.4.3/lib/action_view/helpers/rendering_helper.rb:33
[2020-09-09T18:23:27.749Z] c:0032 p:0010 s:0256 e:000255 BLOCK /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/debug_exceptions.rb:46
[2020-09-09T18:23:27.749Z] c:0031 p:0033 s:0253 e:000252 METHOD /home/docker/.gem/2.6.6/gems/activesupport-5.2.4.3/lib/active_support/logger_silence.rb:21
[2020-09-09T18:23:27.749Z] c:0030 p:0033 s:0247 e:000246 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/debug_exceptions.rb:46
[2020-09-09T18:23:27.749Z] c:0029 p:0068 s:0241 e:000240 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/debug_exceptions.rb:102
[2020-09-09T18:23:27.749Z] c:0028 p:0078 s:0231 e:000230 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/debug_exceptions.rb:87
[2020-09-09T18:23:27.749Z] c:0027 p:0039 s:0222 e:000221 RESCUE /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/debug_exceptions.rb:71
[2020-09-09T18:23:27.749Z] c:0026 p:0127 s:0218 e:000217 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/debug_exceptions.rb:59
[2020-09-09T18:23:27.749Z] c:0025 p:0024 s:0207 e:000206 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/show_exceptions.rb:33
[2020-09-09T18:23:27.749Z] c:0024 p:0042 s:0200 E:0007f0 METHOD /home/docker/.gem/2.6.6/gems/railties-5.2.4.3/lib/rails/rack/logger.rb:38
[2020-09-09T18:23:27.749Z] c:0023 p:0051 s:0190 e:000189 METHOD /home/docker/.gem/2.6.6/gems/railties-5.2.4.3/lib/rails/rack/logger.rb:28
[2020-09-09T18:23:27.749Z] c:0022 p:0053 s:0184 e:000183 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/remote_ip.rb:81
[2020-09-09T18:23:27.749Z] c:0021 p:0019 s:0178 e:000177 METHOD /home/docker/.gem/2.6.6/gems/request_store-1.5.0/lib/request_store/middleware.rb:19
[2020-09-09T18:23:27.749Z] c:0020 p:0091 s:0171 e:000170 METHOD /usr/src/app/app/middleware/request_context_generator.rb:49
[2020-09-09T18:23:27.749Z] c:0019 p:0091 s:0160 e:000159 METHOD /home/docker/.gem/2.6.6/gems/rack-2.2.3/lib/rack/method_override.rb:24
[2020-09-09T18:23:27.749Z] c:0018 p:0059 s:0154 e:000153 METHOD /usr/src/app/app/middleware/prevent_non_multipart_parse.rb:33
[2020-09-09T18:23:27.749Z] c:0017 p:0020 s:0149 e:000148 METHOD /home/docker/.gem/2.6.6/gems/rack-2.2.3/lib/rack/runtime.rb:22
[2020-09-09T18:23:27.749Z] c:0016 p:0033 s:0139 e:000138 METHOD /home/docker/.gem/2.6.6/gems/activesupport-5.2.4.3/lib/active_support/cache/strategy/local_cache_middleware.rb:29
[2020-09-09T18:23:27.749Z] c:0015 p:0016 s:0132 e:000131 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/executor.rb:14
[2020-09-09T18:23:27.749Z] c:0014 p:0085 s:0124 e:000123 METHOD /home/docker/.gem/2.6.6/gems/actionpack-5.2.4.3/lib/action_dispatch/middleware/static.rb:127
[2020-09-09T18:23:27.749Z] c:0013 p:0008 s:0116 e:000115 METHOD /home/docker/.gem/2.6.6/gems/rack-2.2.3/lib/rack/sendfile.rb:110
[2020-09-09T18:23:27.749Z] c:0012 p:0020 s:0104 e:000103 METHOD /home/docker/.gem/2.6.6/gems/railties-5.2.4.3/lib/rails/engine.rb:524 [FINISH]
[2020-09-09T18:23:27.749Z] c:0011 p:---- s:0098 e:000097 CFUNC :public_send
[2020-09-09T18:23:27.749Z] c:0010 p:0028 s:0092 e:000091 METHOD /home/docker/.gem/2.6.6/gems/railties-5.2.4.3/lib/rails/railtie.rb:190
[2020-09-09T18:23:27.749Z] c:0009 p:0009 s:0085 e:000084 BLOCK /usr/src/app/spec/selenium/test_setup/selenium_driver_setup.rb:445
[2020-09-09T18:23:27.749Z] c:0008 p:0022 s:0082 e:000081 METHOD /usr/local/lib/ruby/2.6.0/monitor.rb:235
[2020-09-09T18:23:27.749Z] c:0007 p:0054 s:0078 e:000077 BLOCK /usr/src/app/spec/selenium/test_setup/selenium_driver_setup.rb:445
[2020-09-09T18:23:27.749Z] c:0006 p:0144 s:0072 e:000071 BLOCK /usr/src/app/spec/selenium/test_setup/selenium_driver_setup.rb:470
[2020-09-09T18:23:27.749Z] c:0005 p:0263 s:0065 e:000064 METHOD /home/docker/.gem/2.6.6/gems/puma-4.3.3/lib/puma/server.rb:682
[2020-09-09T18:23:27.749Z] c:0004 p:0029 s:0040 e:000039 METHOD /home/docker/.gem/2.6.6/gems/puma-4.3.3/lib/puma/server.rb:472
[2020-09-09T18:23:27.749Z] c:0003 p:0065 s:0026 e:000025 BLOCK /home/docker/.gem/2.6.6/gems/puma-4.3.3/lib/puma/server.rb:328
[2020-09-09T18:23:27.749Z] c:0002 p:0107 s:0016 e:000015 BLOCK /home/docker/.gem/2.6.6/gems/puma-4.3.3/lib/puma/thread_pool.rb:134 [FINISH]
[2020-09-09T18:23:27.749Z] c:0001 p:---- s:0003 e:000002 (none) [FINISH]

Current workaround?
We were able to circumvent this error by pinning back to v2.5.2

@tarcieri
Copy link
Contributor

See #248

@ioquatix ioquatix self-assigned this Sep 15, 2020
@tenderlove
Copy link

tenderlove commented Sep 15, 2020

Do you have any other information you can provide? Maybe a repro or a core file? The message object allocation during garbage collection phase means that some GC callback is trying to allocate a Ruby object.

#248 seems like a fix that should be applied (it looks like self might be uninitialized in some cases) but rb_gc_mark shouldn't allocate anything even if the pointer is bad, so I'm not convinced this SEGV has anything to do with 507c199.

IOW 507c199 could cause SEGVs, but I'm not convinced it's causing this one.

@boazsegev
Copy link
Contributor

boazsegev commented Sep 15, 2020

@tenderlove - I solved the issue in nio4r by initializing the allocated memory to (mostly) zero before returning from the alloc function.

This means that the Ruby interpreter assumes that the Ruby Object is valid immediately after allocation, even before initialization... which might not be true for C extension objects.

Personally I believe only an initialized object should be marked as an "existing" object unless all malloc calls are rerouted to calloc (since malloc may return "junk data" and uninitialized objects are just pieces of recycled arbitrary data).

I didn't read the Ruby code base except in bits and pieces, but perhaps a C-data structure should have a bit set to mark that it was initialized before a rb_gc_mark callback is called for that object? Or, perhaps Objects should be added to the ObjectSpace after initialization rather than after allocation.

Just a meandering thought of post-incident review. I might be completely wrong and maybe my patch solved the issue only only specific machine.

Good luck!

@tenderlove
Copy link

tenderlove commented Sep 15, 2020

@boazsegev sorry, this doesn't make sense. The error clearly states object allocation during garbage collection phase. That error occurs when the GC is running (marking, sweeping, or compacting) and something tries to allocate an object while the GC is running. This typically happens in a C extension's GC callbacks (or if something is running in parallel like a thread or someone forgot to lock the gvl). Specifically the C extensions' mark, sweep, or compact callback.

rb_gc_mark only does some pointer math to look up a table and set a bit. It doesn't allocate any objects, so I don't see how it could cause this error.

If applying the patch fixes this error, great. But you didn't explain how introducing rb_gc_mark causes this error to happen, even if the address is bad.

@boazsegev
Copy link
Contributor

boazsegev commented Sep 15, 2020

@tenderlove ,

I don't think that nio4r exist the GVL for new object allocation. I think it only exists the GVL during the libev loop.

My working theory was (is?) that the issue occurs when (in multi-threaded environments):

  1. thread A allocates an object. The alloc function returns and the thread is interrupted and paused after adding the newly created object to the ObjectSpace but before the initialization function initializes the object.

  2. thread B tries to allocate an object. GC performs a memory sweep to clear unused memory and the newly allocated (but uninitialized object) is asked by the GC "what objects do you reference?"

  3. The object reviews its own junk data and reports that some random memory address (VALUE) is referenced... at which point anything could happen. Ruby crashing when GC tries to set a bit on a non-existing object that points nowhere is probably a best case scenario.

P.S.

If we're looking at Ruby C extensions, maybe a documentation note is all we need - something to better explain the new GC behavior and the desired use of xmalloc

@tenderlove
Copy link

Are you saying that nio4r is allocating objects inside a thread that isn't holding the GVL lock?

@boazsegev
Copy link
Contributor

Nope. From what I know of the nio4r code base, it keeps the GVL lock as much as possible. Allocations are performed within the lock. Only the polling is performed outside the lock.

@boazsegev
Copy link
Contributor

@tenderlove - this is the only place where the GVL is released, which is the polling system call inside libev.

The threads allocate objects safely, within the GVL.

@ioquatix
Copy link
Member

Okay, I have released v2.5.4 can you please check it.

@tarcieri
Copy link
Contributor

rb_gc_mark only does some pointer math to look up a table and set a bit. It doesn't allocate any objects, so I don't see how it could cause this error.

@tenderlove it looks like rb_gc_mark was being passed a pointer from uninitialized memory.

There was another error involving a SEGV inside of pure Ruby code:

#244 (comment)

I'm not sure by what mechanism object allocation in a GC pass might manifest, but I think it's safe to say there was some rather severe memory corruption at play here.

@boazsegev
Copy link
Contributor

Personally I think it's a documentation issue, where we might need to clarify this part of the documentation for C extension developers.

It should be explicitly stated that C object initialization MUST be performed during the allocation function rather than the Ruby object initialization function. i.e., that Data_Wrap_Struct expects initialized memory.

It would be nice (but perhaps redundant) if developers knew the reason was that the GC mark callback (this code, in our case) might be called immediately after memory allocation and before Ruby object initialization.

Then again, I find that the documentation for C extensions is generally scarce. Half the time I am digging in the Ruby source code to find what I want to know. I totally understand this, it's no big deal. Documentation for C extension development is a huge undertaking... but there's already a section here about TypedData_Wrap_Struct and also here... I think those sections could be clarified.

@ioquatix
Copy link
Member

We should probably migrate nio4r to use TypedData.

@ioquatix
Copy link
Member

@Willianvdv can you please try the latest release and report back?

@bwillis
Copy link

bwillis commented Sep 16, 2020

I work with @Willianvdv. I bumped nio4r to v2.5.4 this afternoon when I saw this fix up hoping it was related. With v2.5.3, it was roughly 50% failure rate on our full CI run and with v2.5.4 I've had 7 full runs without any failures. That gives me enough confidence to say that it's fix 😀

Thanks for the quick turn around! 🙏

@ioquatix
Copy link
Member

ioquatix commented Sep 16, 2020

@boazsegev should take all the credit for this, but thanks everyone for working so quickly to get this resolved. When I woke up and saw this, I had a moment of stress (unknown segv X millions of installs = bad day), but the team really came together to help out. What a fantastic example of collaboration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants