New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threads getting saturated and locking up #1399
Comments
If the behavior hasn't changed since 2.15.3 to 3.10, it's probably a locked mutex and not a problem with Puma. In Rails, we have DebugLocks. I would give that a shot. If that doesn't turn up anything, you need to dig in with dtrace. You could also try #1320. Please reopen if you've got anything reproducible. |
@nateberkopec is there anything like DebugLocks for rails v4, or is it only v5? Also #1320 notes that SIGINFO is BSD only. is there a way to get that information from puma on linux? |
@dnd yeah that's part of the reason I didn't merge it yet. we're kind of "out of signals". it's a pretty simple patch, you can change it to respond to any signal that we don't already (see docs/signals.md). DebugLocks is Rails 5+ only. |
@nateberkopec thanks. Instead of using a signal, since pickings are slim is this something the control server could have an endpoint for and return? I'm not familiar enough with it to know if that's possible. |
Perhaps? Haven't investigated. |
The reason I was using signals was because that was the original feature request. |
In an application that has functioned flawlessly for 20 months, last week out of nowhere we started finding Puma becoming unresponsive on all servers. Which lead to nginx erroring with
upstream timed out (110: Connection timed out) while reading response header from upstream
, and then causes haproxy to take the machine out of rotation. The Puma servers do not recover from this without being restarted. Therunning
count stays at the thread max, with some in the backlog, and doesn't come down.The problem appears to be intermittent, with no immediately discernible cause. Right now it happens once every day. Sometimes twice. As mentioned, this application has run without problem for 20 months. The last release for this application before the problem started was about a month ago.
Load on the servers is very low at only about
.1-.3
. Total memory usage has been consistent around 99%. Prior to first lockup, swap usage was also close to max regularly. Since the first incident swap usage has never gone above 5% before the incident happened again. Don't know if that means anything or not. Network usage also shows nothing abnormal before or during the lockup, other than it dropping to 0 when the machine is dropped from rotation.After it happened regularly a couple times we upgraded from v2.15.3 to v3.10.0, but nothing changed. This application is an internal API using Grape mounted inside of Rails, running on Ruby 2.1.2. It is only connected to by two internal applications using faraday and net-http-persistent. Neither of these applications have had releases within the two weeks leading up to this problem starting.
The application is currently hosted on 4 machines, each with 2 cores, and 2G memory. Before the problems they were configured with 2 workers, and 6 threads max. Since the problem started I attempted increasing the max threads to 12, and that caused the lockup to happen much quicker. I have moved them down to 4 threads, and that isn't worse, but it isn't better, and the problem still occurs.
I'm at a loss at to where to go next with this. Like I said, it's been running without issue, and nothing has changed close to this happening.
Is there any way to see what request each thread is handling, to see if maybe there's some kind of correlation there?
Any help at all would be much appreciated.
The text was updated successfully, but these errors were encountered: