Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threads getting saturated and locking up #1399

Closed
dnd opened this issue Aug 22, 2017 · 6 comments
Closed

Threads getting saturated and locking up #1399

dnd opened this issue Aug 22, 2017 · 6 comments

Comments

@dnd
Copy link

dnd commented Aug 22, 2017

In an application that has functioned flawlessly for 20 months, last week out of nowhere we started finding Puma becoming unresponsive on all servers. Which lead to nginx erroring with upstream timed out (110: Connection timed out) while reading response header from upstream, and then causes haproxy to take the machine out of rotation. The Puma servers do not recover from this without being restarted. The running count stays at the thread max, with some in the backlog, and doesn't come down.

The problem appears to be intermittent, with no immediately discernible cause. Right now it happens once every day. Sometimes twice. As mentioned, this application has run without problem for 20 months. The last release for this application before the problem started was about a month ago.

Load on the servers is very low at only about .1-.3. Total memory usage has been consistent around 99%. Prior to first lockup, swap usage was also close to max regularly. Since the first incident swap usage has never gone above 5% before the incident happened again. Don't know if that means anything or not. Network usage also shows nothing abnormal before or during the lockup, other than it dropping to 0 when the machine is dropped from rotation.

After it happened regularly a couple times we upgraded from v2.15.3 to v3.10.0, but nothing changed. This application is an internal API using Grape mounted inside of Rails, running on Ruby 2.1.2. It is only connected to by two internal applications using faraday and net-http-persistent. Neither of these applications have had releases within the two weeks leading up to this problem starting.

The application is currently hosted on 4 machines, each with 2 cores, and 2G memory. Before the problems they were configured with 2 workers, and 6 threads max. Since the problem started I attempted increasing the max threads to 12, and that caused the lockup to happen much quicker. I have moved them down to 4 threads, and that isn't worse, but it isn't better, and the problem still occurs.

I'm at a loss at to where to go next with this. Like I said, it's been running without issue, and nothing has changed close to this happening.

Is there any way to see what request each thread is handling, to see if maybe there's some kind of correlation there?

Any help at all would be much appreciated.

@nateberkopec
Copy link
Member

If the behavior hasn't changed since 2.15.3 to 3.10, it's probably a locked mutex and not a problem with Puma.

In Rails, we have DebugLocks. I would give that a shot.

If that doesn't turn up anything, you need to dig in with dtrace. You could also try #1320. Please reopen if you've got anything reproducible.

@dnd
Copy link
Author

dnd commented Aug 22, 2017

@nateberkopec is there anything like DebugLocks for rails v4, or is it only v5? Also #1320 notes that SIGINFO is BSD only. is there a way to get that information from puma on linux?

@nateberkopec
Copy link
Member

@dnd yeah that's part of the reason I didn't merge it yet. we're kind of "out of signals". it's a pretty simple patch, you can change it to respond to any signal that we don't already (see docs/signals.md). DebugLocks is Rails 5+ only.

@dnd
Copy link
Author

dnd commented Aug 22, 2017

@nateberkopec thanks. Instead of using a signal, since pickings are slim is this something the control server could have an endpoint for and return? I'm not familiar enough with it to know if that's possible.

@nateberkopec
Copy link
Member

Instead of using a signal, since pickings are slim is this something the control server could have an endpoint for and return?

Perhaps? Haven't investigated.

@nateberkopec
Copy link
Member

The reason I was using signals was because that was the original feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants