Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidekiq worker is stuck, doesn't respond to TTIN #2796

Closed
DouweM opened this issue Jan 27, 2016 · 7 comments
Closed

Sidekiq worker is stuck, doesn't respond to TTIN #2796

DouweM opened this issue Jan 27, 2016 · 7 comments

Comments

@DouweM
Copy link
Contributor

DouweM commented Jan 27, 2016

Running GitLab, I'm seeing Sidekiq v4.0.1 get stuck periodically and stop processing jobs or responding at all.
I've never seen this kind of issue with any other GitLab instance, but on this particular one this happens about every day, and I'm at a loss.

ps aux | grep sidekiq shows:

git      36646 20.0 10.6 2827564 1294940 ?     Tsl  14:17   2:33 sidekiq 4.0.1 gitlab-rails [25 of 25 busy]                                                                                                                                                                                                                               

The Processes list in Sidekiq Web is empty, the Busy counter reads 0, and the Jobs list is empty.

Nothing shows up in the Sidekiq log for kill -TTIN 36646

GDB output per the Troubleshooting doc: https://gist.github.com/DouweM/0f15e8f841a7d5643255

(gdb) call (void)rb_backtrace() output from the log: https://gist.github.com/DouweM/36a8e0bfbb230d876062
The last frame is incorrectly ascribed to gitlab_git, in reality it's Rugged::Diff#each_patch.

Any idea what could be going on?

Thanks a lot!

@mperham
Copy link
Collaborator

mperham commented Jan 27, 2016

I don't support Gitlab, please contact them for help.

@mperham mperham closed this as completed Jan 27, 2016
@DouweM
Copy link
Contributor Author

DouweM commented Jan 27, 2016

@mperham Right, I don't expect you to, I work at GitLab so that's actually my job :)

This issue was reported to us as Sidekiq getting stuck, and from the symptoms (25 of 25 busy, not showing up under Processes, not responding to TTIN) it sounds to me like an issue in Sidekiq rather than in GitLab code. I would love some help with debugging as I'm not familiar enough with Sidekiq, Ruby threading internals or GDB to go any further from here.

Does the GDB output or the rb_backtrace() lead you to believe the issue lies with GitLab, specifically in the top frame from that stacktrace, or is that just the place where the Ruby thread got stuck for some external reason?

@mperham
Copy link
Collaborator

mperham commented Jan 27, 2016

Sidekiq getting stuck is always due to application code. I can't remember the last time Sidekiq had an actual bug causing a lockup. If Sidekiq is not responding to TTIN, that's typically due to a native gem which is holding the GVL erroneously. According to your output, this thread is performing a rugged operation without releasing the GVL:

Thread 20 (Thread 0x7f2e9e10e700 (LWP 36692)):
#0  0x0000003d7b0f80ce in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x0000003d7b07d313 in _L_lock_10110 () from /lib64/libc.so.6
#2  0x0000003d7b07abbf in malloc () from /lib64/libc.so.6
#3  0x0000003d7b07c0f8 in realloc () from /lib64/libc.so.6
#4  0x00007f2ea0263e3e in git__realloc (ptr=0x7f2e6bb674f0, size=3538944) at /var/cache/omnibus/src/libgit2/src/util.h:211
#5  0x00007f2ea02640de in git_buf_try_grow (buf=0x7f2e9e10b8e0, target_size=3407872, mark_oom=true) at /var/cache/omnibus/src/libgit2/src/buffer.c:79
#6  0x00007f2ea026422a in git_buf_grow_by (buffer=0x7f2e9e10b8e0, additional_size=1048576) at /var/cache/omnibus/src/libgit2/src/buffer.c:115
#7  0x00007f2ea02b4b3e in git_zstream_deflatebuf (out=0x7f2e9e10b8e0, in=0x116faae0, in_len=6963796) at /var/cache/omnibus/src/libgit2/src/zstream.c:137
#8  0x00007f2ea02747a8 in create_binary (out_type=0x7f2e9e10b950, out_data=0x7f2e9e10b958, out_datalen=0x7f2e9e10b960, out_inflatedlen=0x7f2e9e10b968, a_data=0x7f2ea02ebed0 "", a_datalen=0, 
    b_data=0x116faae0 "\037\213\b\bf\205qV", b_datalen=6963796) at /var/cache/omnibus/src/libgit2/src/diff_patch.c:263
#9  0x00007f2ea02749c6 in diff_binary (output=0x7f2e9e10ba40, patch=0x7f2e69907b10) at /var/cache/omnibus/src/libgit2/src/diff_patch.c:320
#10 0x00007f2ea0274b69 in diff_patch_generate (patch=0x7f2e69907b10, output=0x7f2e9e10ba40) at /var/cache/omnibus/src/libgit2/src/diff_patch.c:362
#11 0x00007f2ea0275cd8 in git_patch_from_diff (patch_ptr=0x7f2e9e10bae8, diff=0x7f2e684e01b0, idx=7) at /var/cache/omnibus/src/libgit2/src/diff_patch.c:773
#12 0x00007f2ea07f2610 in rb_git_diff_each_patch (self=121977600) at rugged_diff.c:475

Other threads, including 9 and 17, are blocked, waiting for thread 20 to release the GVL. Make sure you are using the latest rugged and maybe open an issue with them. It's not safe to wait for an OS lock while also holding the GVL.

As a side note, this is the type of diagnosis I usually charge for. This one's free but I'd encourage GitLab to purchase a license to get pro support.

@DouweM
Copy link
Contributor Author

DouweM commented Jan 28, 2016

Thanks @mperham, that helps tremendously. I will continue debugging from here.

It's a testament to Sidekiq's stability and robustness that we've never needed support until now. If we ever need help again we will gladly reach out to you via http://sidekiq.org/support and get a support contract. If I came off a little "demanding" with this issue, I didn't mean to. I understand that your time is valuable.

Sidekiq Pro looks great, but is currently not interesting to us since we would need an Appliance license and it would only be available to our GitLab Enterprise Edition customers. Sidekiq "basic" is currently serving our users more than adequately, whether they be on the community or the enterprise edition.

@mperham
Copy link
Collaborator

mperham commented Jan 28, 2016

Glad you see my POV and thanks for the kind words.

As a suggestion, you can buy a license just for the support, you don't have to distribute it. Travis CI and Discourse are Sidekiq Pro customers for the support, they actually don't ship or use the Pro bits in their products. $950/yr is a lot cheaper than an appliance license.

@DouweM
Copy link
Contributor Author

DouweM commented Jan 28, 2016

@mperham Fair enough, I'll keep that in mind.

@DouweM
Copy link
Contributor Author

DouweM commented Apr 8, 2016

In case anyone else stumbles upon this issue, the clue is in this thread dump:

Thread 20 (Thread 0x7f2e9e10e700 (LWP 36692)):
#0  0x0000003d7b0f80ce in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x0000003d7b07d313 in _L_lock_10110 () from /lib64/libc.so.6
#2  0x0000003d7b07abbf in malloc () from /lib64/libc.so.6
#3  0x0000003d7b07c0f8 in realloc () from /lib64/libc.so.6
#4  0x00007f2ea0263e3e in git__realloc (ptr=0x7f2e6bb674f0, size=3538944) at /var/cache/omnibus/src/libgit2/src/util.h:211
#5  0x00007f2ea02640de in git_buf_try_grow (buf=0x7f2e9e10b8e0, target_size=3407872, mark_oom=true) at /var/cache/omnibus/src/libgit2/src/buffer.c:79
#6  0x00007f2ea026422a in git_buf_grow_by (buffer=0x7f2e9e10b8e0, additional_size=1048576) at /var/cache/omnibus/src/libgit2/src/buffer.c:115
#7  0x00007f2ea02b4b3e in git_zstream_deflatebuf (out=0x7f2e9e10b8e0, in=0x116faae0, in_len=6963796) at /var/cache/omnibus/src/libgit2/src/zstream.c:137
#8  0x00007f2ea02747a8 in create_binary (out_type=0x7f2e9e10b950, out_data=0x7f2e9e10b958, out_datalen=0x7f2e9e10b960, out_inflatedlen=0x7f2e9e10b968, a_data=0x7f2ea02ebed0 "", a_datalen=0, 
    b_data=0x116faae0 "\037\213\b\bf\205qV", b_datalen=6963796) at /var/cache/omnibus/src/libgit2/src/diff_patch.c:263
#9  0x00007f2ea02749c6 in diff_binary (output=0x7f2e9e10ba40, patch=0x7f2e69907b10) at /var/cache/omnibus/src/libgit2/src/diff_patch.c:320
#10 0x00007f2ea0274b69 in diff_patch_generate (patch=0x7f2e69907b10, output=0x7f2e9e10ba40) at /var/cache/omnibus/src/libgit2/src/diff_patch.c:362
#11 0x00007f2ea0275cd8 in git_patch_from_diff (patch_ptr=0x7f2e9e10bae8, diff=0x7f2e684e01b0, idx=7) at /var/cache/omnibus/src/libgit2/src/diff_patch.c:773
#12 0x00007f2ea07f2610 in rb_git_diff_each_patch (self=121977600) at rugged_diff.c:475

As implausible as it sounds, malloc got in a deadlock.

After a lot of head scratching, someone on twitter pointed out to me that they had the same issue, and fixed it by updating glibc. Apparently, CentOS / Red Hat / Scientific Linux 6.7 ships with a broken glibc-2.12-1.166 which can cause a deadlock in malloc/free: https://bugzilla.redhat.com/show_bug.cgi?id=1244002, https://rhn.redhat.com/errata/RHBA-2015-1465.html.

Updating glibc resolved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants