New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Child workers becoming zombies/unreachable after exception #2552
Comments
Btw, we just upgraded puma to |
@feliperaul I'm curious, what wicked_pdf version are you using? (I'm doubtful that Puma can do something about the problem you are experiencing) |
Hm. I'm confused as to why an exception in one child worker would lead to all of the child processes stopping. It's kind of a weird part of the Ruby code, I'm not sure if that's really an exception or a segfault, it appears to be neither: https://github.com/ruby/ruby/blob/98e27016c93455d4e9e208d0666d85929cb62857/vm_insnhelper.c#L91 |
I was on version Looking again on the exception on Honeybadger, I had 24 occurences of that exception yesterday, spread between 10 hours; I have exactly 24 workers, so this must not be a coincidence. Analyzing the logs alongside, I can confirm that after the LAST exception we stopped processing all requests until I woke up this morning and restarted puma. So what probably happened is that on every one of this exceptions, that specific child worker halted and stopped processing requests, so capacity was reducing during the day but we only noticed when the last child worker suffered it. I know puma is not causing this, but it would be great if this exception/segfault could be somehow rescued instead of freezing the worker, or at least sending a signal to the master process to re-spawn that worker. |
Cool. Thanks. Well, the child can't rescue it's own segfault. It can rescue it's own exceptions, but this type of thing can't be rescued. That worker is dead and must be killed and restarted. I'll have to look at how our child process heartbeat code works again. I don't think it's unreasonable for a master process to try to kill and restart an unresponsive child. |
@nateberkopec that would be awesome and bring even greater resiliency to Puma. |
This is the mechanism that is supposed to cull and then restart a hung worker. It should work in this case, so I am marking this as a bug. |
Part of the problem with fixing this is going to be that I have no idea how to artificially trigger this exception. |
This report makes me wonder if the problem is that the child process has no (live) threads in the threadpool. |
@nateberkopec Nate, that can be the case because our setup is single-threaded ( The report link you posted is a translation of this github issue here: mileszs/wicked_pdf#891 |
I am not able to reproduce machine stack overflow in worker, but the following code can cause child workers to freeze longer than require('pty')
puts "slowing worker #{Process.pid}"
PTY.spawn("ruby -e 'sleep 100000'") do |r, w, pid|
Process.wait pid
end When spawning a new process to do heavy work, it's better to set a timeout. I fixed a similar process hanging problem when using ImageMagick to resize images for a CDN service. If require('pty')
require('timeout')
puts "slowing worker #{Process.pid}"
PTY.spawn("ruby -e 'sleep 100000'") do |r, w, pid|
begin
Timeout::timeout(1) do
Process.wait pid
end
rescue Timeout::Error
puts "child process timeout"
Process.kill "KILL", pid
end
end I put the code just before Hope it's helpful. |
@nateberkopec JSON.dump and cyclic hash can cause Being able to reproduce the exception, I found that when Here is the log when
I think the fix is to check if |
Thanks. To clarify, EDIT: Found a place in the Ruby docs that states |
To close this issue, we need to find a way to detect this condition in a child worker, kill it, and start a new one. |
@nateberkopec @calvinxiao Thanks a lot for this guys, we'll update puma and report back. Didn't comment further because I currently don't know puma internals enough to be helpful here. |
Reopening as I think @calvinxiao marked this PR incorrectly as closing this issue. Printing a backtrace accurately is helpful, but I'm not sure it actually fixes the timeout/child processes not accepting traffic. Calvin, thoughts? |
Near the top of the first post, it shows a HoneyBadger log with
|
Unfortunately #2607 didn't fix it :( I upgraded puma in production to Today I got two I decided to take a look at
All others show The So, if Would it be a silly idea to have a BTW, here's the honeybadger stack trace in full:
@nateberkopec Could you please reopen? |
Line 522 in 01ffec5
Hi @feliperaul , it looks like you have |
sure @calvinxiao:
I pasted my entire |
'running' may not be the best name. It is actually the number of threads created in the worker. So, if you have EDIT: I've always wondered about the benefit of setting With it set to zero, if the worker is needed, it has to create a thread... |
@MSP-Greg got it. But then, what if puma checked if the Is there a mechanism for ensuring this already (workers with 0 threads being recovered) and it's failing in this exception, or this mechanism doesn't yet exist? |
I can reproduce the same logs from Honeybadger using the below app.rb, command
When I use wrk to stress test the local server, here are the suspusious logs I got:
I would suggest you try I can reproduce the hanging workers on WSL2 by these steps:
Reopen the terminal Not sure if Honeybadger will raise any exception, try There is a plugin of Honeybadger for passenger, It's better to start/stop Honeybadger in |
Now that the issue is reproducible, what do you all think about the idea of checking periodically if the |
Woke up last morning with the entire application down due to the same Took some time to go through Puma's source code. First time I did this so bear with me :) Puma seems to already have a bunch of checks for the health of workers and their corresponding thread pool, but I didn't find any check for the situation we're facing here (which means, booted workers with 0 threads even tough So it seems that the Anyways, I coded a quick fix to at least achieve the desired effect for now. Maybe this can help any future googlers as well. Here's my new before_fork do
ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
require_relative '../../application/app/services/puma_stale_workers_monitor' # adjust this to match your folder structure
PumaStaleWorkersMonitor.start
end And in /app/services/puma_stale_workers_monitor.rb. require 'json'
require 'time'
class PumaStaleWorkersMonitor
def self.start
Monitor.new.start
end
class Monitor
def start
Thread.new do
puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Starting"
loop do
puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Monitoring"
stats = Puma.stats_hash
if stats[:booted_workers].to_i > 0 &&
stats.fetch(:worker_status).any? {|worker| worker.fetch(:last_status).has_key?(:running) && worker.fetch(:last_status).fetch(:running) == 0 }
puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Restarting puma "
puts JSON.pretty_generate(stats)
`pumactl -C 'tcp://127.0.0.1:9293' -T myControlToken -F /var/www/app/shared/config/puma.rb restart`
puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Ending"
break
end
sleep 10
end
end
end
end
end |
@feliperaul , I propose this temporary fix to your problem, see if it helps. Seems not all
|
@calvinxiao Calvin, thanks, your solution is much simpler and doesn't require spawning a new Thread for constant monitoring. Today I checked my logs and my Regardless, I will change to your proposed solution and report back any news. Thanks to all you guys for helping me with this. I think future Googlers will find this issue helpful as well. Cheers. |
Describe the bug
Every once in a while (it's happening aprox. monthly), we are experiencing downtime due to puma freezing completely.
When I ssh into the server, I see all 25 workers are there, sitting idle with 0 CPU usage.
Investigating the logs, puma.stderr and puma.stdout give me nothing; but in
production.log
, I can always trace back the last request served, and this is what I get in Honeybadger:It seems that this exception might be caused by
WickedPDF
(andremotipart
together?) - see mileszs/wicked_pdf#891 and mileszs/wicked_pdf#810.However, despite being a long shot, I'm opening this issue here in Puma because maybe something could be done by Puma to prevent the entire worker pool halting to a stall.
A simple
sudo systemctl restart puma.service
puts us back online.After the incident, I can never reproduce the problem manually, not even hitting the same URL that triggered the "machine stack overflow in critical region".
Puma config:
Puma version 5.0.4
To Reproduce
Unfortunately, it happens completely at random.
Desktop (please complete the following information):
Ubuntu 18.04.LTS
Puma version 5.0.4
The text was updated successfully, but these errors were encountered: