Child workers becoming zombies/unreachable after exception #2552

feliperaul · 2021-02-09T10:30:33Z

Describe the bug
Every once in a while (it's happening aprox. monthly), we are experiencing downtime due to puma freezing completely.

When I ssh into the server, I see all 25 workers are there, sitting idle with 0 CPU usage.

Investigating the logs, puma.stderr and puma.stdout give me nothing; but in production.log, I can always trace back the last request served, and this is what I get in Honeybadger:

fatal: machine stack overflow in critical region

Backtrace: middleware_tracing.rb  > 99 >  call(...)

97 | events.notify(:before_call, env) if first_middleware
-- | --
98 |  
99 | result = (target == self) ? traced_call(env) : target.call(env)
100 |  
101 | if first_middleware

It seems that this exception might be caused by WickedPDF (and remotipart together?) - see mileszs/wicked_pdf#891 and mileszs/wicked_pdf#810.

However, despite being a long shot, I'm opening this issue here in Puma because maybe something could be done by Puma to prevent the entire worker pool halting to a stall.

A simple sudo systemctl restart puma.service puts us back online.

After the incident, I can never reproduce the problem manually, not even hitting the same URL that triggered the "machine stack overflow in critical region".

Puma config:

Puma version 5.0.4

[Unit]
Description=Puma {{ param_app_name }} application server
After=network.target

# Socket activation
Requires=puma-{{ param_app_name}}.socket

[Service]
Type=simple

User={{ param_user_name }}

WorkingDirectory=/var/www/{{ param_app_name }}/application

ExecStart=/home/{{ param_user_name }}/.rbenv/shims/bundle exec --keep-file-descriptors puma -C /var/www/{{ param_app_name }}/shared/config/puma.rb

Restart=always

[Install]
WantedBy=multi-user.target

[Unit]
Description=Puma HTTP Server Accept Sockets for {{ param_app_name }}

[Socket]
SocketUser={{ param_user_name }}
SocketGroup={{ param_user_name }}
ListenStream=/var/www/{{ param_app_name }}/shared/tmp/sockets/puma-{{ param_app_name }}.sock

# Socket options matching Puma defaults
NoDelay=true
ReusePort=true
Backlog=1024

[Install]
WantedBy=sockets.target

shared_dir = "/var/www/{{ param_app_name }}/shared"

FileUtils.mkdir_p("#{shared_dir}/tmp/pids")

workers 24
threads 1, 1

bind "unix://#{shared_dir}/tmp/sockets/puma-{{ param_app_name }}.sock"

environment "production"

stdout_redirect "#{shared_dir}/log/puma.stdout.log", "#{shared_dir}/log/puma.stderr.log", true

pidfile "#{shared_dir}/tmp/pids/puma.pid"

before_fork do
  ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end

on_worker_boot do

  ActiveRecord::Base.establish_connection if defined?(ActiveRecord)

  if defined?(MessageBus)
    MessageBus.after_fork
  end

end

lowlevel_error_handler do |e|
  Honeybadger.notify(e) if defined?(Honeybadger)
  [500, {}, ["Error\n"]]
end

To Reproduce
Unfortunately, it happens completely at random.

Desktop (please complete the following information):
Ubuntu 18.04.LTS
Puma version 5.0.4

The text was updated successfully, but these errors were encountered:

feliperaul · 2021-02-09T12:26:10Z

Btw, we just upgraded puma to 5.2.1. I will report back if/when it happens again.

dentarg · 2021-02-09T13:01:57Z

@feliperaul I'm curious, what wicked_pdf version are you using?

(I'm doubtful that Puma can do something about the problem you are experiencing)

nateberkopec · 2021-02-09T14:03:12Z

Hm. I'm confused as to why an exception in one child worker would lead to all of the child processes stopping. It's kind of a weird part of the Ruby code, I'm not sure if that's really an exception or a segfault, it appears to be neither: https://github.com/ruby/ruby/blob/98e27016c93455d4e9e208d0666d85929cb62857/vm_insnhelper.c#L91

feliperaul · 2021-02-09T15:35:06Z

@feliperaul I'm curious, what wicked_pdf version are you using?

(I'm doubtful that Puma can do something about the problem you are experiencing)

I was on version 2.1.0.

@nateberkopec

Looking again on the exception on Honeybadger, I had 24 occurences of that exception yesterday, spread between 10 hours; I have exactly 24 workers, so this must not be a coincidence.

Analyzing the logs alongside, I can confirm that after the LAST exception we stopped processing all requests until I woke up this morning and restarted puma.

So what probably happened is that on every one of this exceptions, that specific child worker halted and stopped processing requests, so capacity was reducing during the day but we only noticed when the last child worker suffered it.

I know puma is not causing this, but it would be great if this exception/segfault could be somehow rescued instead of freezing the worker, or at least sending a signal to the master process to re-spawn that worker.

nateberkopec · 2021-02-09T15:37:02Z

Cool. Thanks.

Well, the child can't rescue it's own segfault. It can rescue it's own exceptions, but this type of thing can't be rescued. That worker is dead and must be killed and restarted.

I'll have to look at how our child process heartbeat code works again. I don't think it's unreasonable for a master process to try to kill and restart an unresponsive child.

feliperaul · 2021-02-09T15:43:47Z

@nateberkopec that would be awesome and bring even greater resiliency to Puma.

nateberkopec · 2021-02-09T17:18:35Z

This is the mechanism that is supposed to cull and then restart a hung worker.

It should work in this case, so I am marking this as a bug.

nateberkopec · 2021-02-09T17:23:14Z

Part of the problem with fixing this is going to be that I have no idea how to artificially trigger this exception.

nateberkopec · 2021-02-09T17:25:49Z

This report makes me wonder if the problem is that the child process has no (live) threads in the threadpool.

feliperaul · 2021-02-09T19:48:37Z

@nateberkopec Nate, that can be the case because our setup is single-threaded (threads 1, 1)

The report link you posted is a translation of this github issue here: mileszs/wicked_pdf#891

calvinxiao · 2021-04-20T16:03:06Z

@feliperaul

I am not able to reproduce machine stack overflow in worker, but the following code can cause child workers to freeze longer than timeout

require('pty')

puts "slowing worker #{Process.pid}"
PTY.spawn("ruby -e 'sleep 100000'") do |r, w, pid|
    Process.wait pid
end

When spawning a new process to do heavy work, it's better to set a timeout. I fixed a similar process hanging problem when using ImageMagick to resize images for a CDN service.

If wkhtml2pdf doesn't have a timeout option, you can do it with Timeout::timeout.

require('pty')
require('timeout')

puts "slowing worker #{Process.pid}"
PTY.spawn("ruby -e 'sleep 100000'") do |r, w, pid|
    begin
        Timeout::timeout(1) do 
            Process.wait pid
        end
    rescue Timeout::Error
        puts "child process timeout"
        Process.kill "KILL", pid
    end
end

I put the code just before @app.call(env) request.rb to verify it.

Hope it's helpful.

calvinxiao · 2021-04-21T02:17:06Z

@nateberkopec JSON.dump and cyclic hash can cause fatal: machine stack overflow in critical region

Being able to reproduce the exception, I found that when e.class == "fatal" there is no #backtrace. and in lowlevel_error it assumes e.backtrace is an array.

Here is the log when fatal: machine stack overflow in critical region happens

2021-04-21 09:48:15 +0800 Rack app ("GET /" - (127.0.0.1)): #<fatal: machine stack overflow in critical region>
before client_error NoMethodError
Error reached top of thread-pool: undefined method `join' for nil:NilClass (NoMethodError)
["/home/eagle/github/my/puma/lib/puma/server.rb:530:in `lowlevel_error'", 
"/home/eagle/github/my/puma/lib/puma/server.rb:503:in `client_error'", 
"/home/eagle/github/my/puma/lib/puma/server.rb:471:in `rescue in process_client'", 
"/home/eagle/github/my/puma/lib/puma/server.rb:468:in `process_client'", 
"/home/eagle/github/my/puma/lib/puma/thread_pool.rb:145:in `block in spawn_thread'"]

I think the fix is to check if e.backtrace is nil.

MSP-Greg · 2021-04-21T02:20:19Z

@calvinxiao

Thanks. To clarify, e.backtrace is nil, as opposed to undefined?

EDIT: Found a place in the Ruby docs that states In the case no backtrace has been set, nil is returned...

nateberkopec · 2021-04-23T14:15:43Z

To close this issue, we need to find a way to detect this condition in a child worker, kill it, and start a new one.

feliperaul · 2021-04-23T14:45:41Z

@nateberkopec @calvinxiao Thanks a lot for this guys, we'll update puma and report back. Didn't comment further because I currently don't know puma internals enough to be helpful here.

nateberkopec · 2021-04-23T17:47:12Z

Reopening as I think @calvinxiao marked this PR incorrectly as closing this issue. Printing a backtrace accurately is helpful, but I'm not sure it actually fixes the timeout/child processes not accepting traffic. Calvin, thoughts?

MSP-Greg · 2021-04-23T18:44:33Z

Near the top of the first post, it shows a HoneyBadger log with

fatal: machine stack overflow in critical region

Server#lowlevel_error calls the handler (HoneyBadger) specified by @options[:lowlevel_error_handler], then logs the backtrace. But, with that error backtrace is nil, so Puma was throwing an error. I think.

feliperaul · 2021-04-27T03:38:46Z

Unfortunately #2607 didn't fix it :(

I upgraded puma in production to gem 'puma', git: 'https://github.com/puma/puma', ref: 'a02f9241aef8a2e8814c745e84457ca0eabe5e8d', to have the #2607 fix in it.

Today I got two fatal: machine stack overflow in critical region exceptions in Honeybadger.

I decided to take a look at pumactl (something I hadn't done before) and, not coincidently, there are exactly 2 workers whose status running is 0, like this:

{
      "started_at": "2021-04-25T16:54:11Z",
      "pid": 30222,
      "index": 22,
      "phase": 2,
      "booted": true,
      "last_checkin": "2021-04-27T03:26:31Z",
      "last_status": {
        "backlog": 0,
        "running": 0,
        "pool_capacity": 1,
        "max_threads": 1,
        "requests_count": 35443
      }
    },
{
      "started_at": "2021-04-25T16:54:02Z",
      "pid": 30093,
      "index": 19,
      "phase": 2,
      "booted": true,
      "last_checkin": "2021-04-27T03:26:32Z",
      "last_status": {
        "backlog": 0,
        "running": 0,
        "pool_capacity": 1,
        "max_threads": 1,
        "requests_count": 33101
      }
    },

All others show running: 1.

The pid indicated in the Honeybadger environment matches the pids of the workers running: 0.

So, if pumactl actually knows that these two workers are not running, I think we still have a bug if it's not recovering those two workers, right?

Would it be a silly idea to have a while true; (check all workers that running are 0 and respawn them); sleep 1 sort of thing, even if acting as backup?

BTW, here's the honeybadger stack trace in full:

/var/www/app/shared/config/puma.rb:48 :in `block in _load_from`
[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/server.rb:522 :in `lowlevel_error`
[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/request.rb:98 :in `rescue in handle_request`
[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/request.rb:74 :in `handle_request`
[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/server.rb:433 :in `process_client`
[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/thread_pool.rb:145 :in `block in spawn_thread`
More (1 line)
Nested Exceptions
fatal: machine stack overflow in critical region
/var/www/app/shared/config/puma.rb:48 :in `block in _load_from`

@nateberkopec Could you please reopen?

calvinxiao · 2021-04-27T13:46:50Z

[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/server.rb:522 :in lowlevel_error

puma/lib/puma/server.rb

Line 522 in 01ffec5

return handler.call(e)

519    def lowlevel_error(e, env, status=500)
520      if handler = @options[:lowlevel_error_handler]
521        if handler.arity == 1
522          return handler.call(e)

Hi @feliperaul , it looks like you have lowlevel_error_handler option, can you let let me know how you define this option block?

feliperaul · 2021-04-27T13:59:30Z

sure @calvinxiao:

lowlevel_error_handler do |e|
  Honeybadger.notify(e) if defined?(Honeybadger)
  [500, {}, ["Error\n"]]
end

I pasted my entire puma.rb config file at the inaugural issue comment, you can check it in it's entirety here #2552 (comment)

MSP-Greg · 2021-04-27T15:47:22Z

@feliperaul

'running' may not be the best name. It is actually the number of threads created in the worker. So, if you have min_threads set to zero, zero is a valid value for 'running'.

EDIT: I've always wondered about the benefit of setting min_threads to zero instead of one, especially with a high number of workers and max_threads set to one.

With it set to zero, if the worker is needed, it has to create a thread...

feliperaul · 2021-04-27T16:20:01Z

@MSP-Greg got it.

But then, what if puma checked if the running value is < min_threads defined in the config, and then tried to respawn this worker, so it doesn't sit there idling?

Is there a mechanism for ensuring this already (workers with 0 threads being recovered) and it's failing in this exception, or this mechanism doesn't yet exist?

calvinxiao · 2021-04-27T16:43:02Z

I can reproduce the same logs from Honeybadger using the below app.rb, command bundle exec bin/puma -w 24 -t 1:1 -C test/config/app.rb --fork-worker

require 'honeybadger'

port ENV['PORT'] if ENV['PORT']

app do |env|
  require 'json'
  c = {}
  c[:c] = c
  ::JSON.dump(c)
  [200, {}, ["embedded app"]]
end


lowlevel_error_handler do |err|
  Honeybadger.notify(err)
  [500, {}, ["Error\n"]]
end

When I use wrk to stress test the local server, here are the suspusious logs I got:

** [Honeybadger] Error report failed: an unknown error occurred. code=error error="HTTP Error: Net::OpenTimeout" level=2 pid=20171

[20090] - Stopping 20123 for phased upgrade...
[20090] - Worker 6 (PID: 20796) booted in 0.0s, phase: 1
[20090] - Stopping 20132 for phased upgrade...
[20090] - TERM sent to 20132...

I would suggest you try preload_app!, fork_worker is still experimental according to the Doc. If worker 0 is unreachable, forking it won't make sense.

I can reproduce the hanging workers on WSL2 by these steps:

run bundle exec bin/puma -w 24 -t 1:1 -C test/config/app.rb --fork-worker
close the terminal

Reopen the terminal ps -ef shows only puma cluster workers [0-23], no main process, server will not accept any requests.

Not sure if Honeybadger will raise any exception, try Honeybadger.notify(e) if defined?(Honeybadger) rescue nil if possible.

There is a plugin of Honeybadger for passenger, It's better to start/stop Honeybadger in before_fork on_worker_boot and on_worker_shutdown

feliperaul · 2021-04-27T18:20:09Z

@calvinxiao

Calvin, my mistake: I just reviewed my puma.rb config and we haven't been using fork_worker for well over one year. We use Ansible to template that puma.rb file, and that fork_worker line was inside an if clause that we currently have set to false, so it was my mistake to paste it here. I just edited the inaugural comment to remove the mention to fork_worker to not mislead anyone looking at this.
Regarding preload_app!, we can't use it because it's incompatible with phased restarts, which we now rely heavily; (a hot restart were stopping all our workers for 10+ seconds during the restart - requests were sitting there waiting due to socket activation; on the other hand, phased restart is flawless and doesn't disturb our users in any way. Amazing feature.
I updated puma.rb to rescue nil the Honeybadger line, as you suggested; the Honeybadger gem doesn't need to be started or stooped, it just works. To be honest, we can even remove the lowlevel_error_handler from our puma config.

Now that the issue is reproducible, what do you all think about the idea of checking periodically if the running threads is < than the min_threads config and, if so, respawn that worker?

feliperaul · 2021-05-03T04:32:27Z

Woke up last morning with the entire application down due to the same fatal misterious exception again. As you can see the exception don't happen all at once, but in a few minutes they were enough to freeze our 24 worker pool:

Took some time to go through Puma's source code. First time I did this so bear with me :)

Puma seems to already have a bunch of checks for the health of workers and their corresponding thread pool, but I didn't find any check for the situation we're facing here (which means, booted workers with 0 threads even tough threads 1,1 was set, which means at least one thread should exist for each worker.

So it seems that the check_workers method here or the reap method here could be two suitable places for checking for the present situation. (Maybe the alive? method being called on the worker in the reap method already tries to do this, but I really couldn't find the alive? method definition).

Anyways, I coded a quick fix to at least achieve the desired effect for now. Maybe this can help any future googlers as well. Here's my new puma.rb file:

before_fork do
  ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)

  require_relative '../../application/app/services/puma_stale_workers_monitor' # adjust this to match your folder structure

  PumaStaleWorkersMonitor.start

end

And in /app/services/puma_stale_workers_monitor.rb.

require 'json'
require 'time'

class PumaStaleWorkersMonitor

  def self.start

    Monitor.new.start

  end

  class Monitor

    def start

      Thread.new do

        puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Starting"

        loop do
          
          puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Monitoring"

          stats = Puma.stats_hash

          if stats[:booted_workers].to_i > 0 &&
            stats.fetch(:worker_status).any? {|worker| worker.fetch(:last_status).has_key?(:running) && worker.fetch(:last_status).fetch(:running) == 0 }

            puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Restarting puma "

            puts JSON.pretty_generate(stats)

            `pumactl -C 'tcp://127.0.0.1:9293' -T myControlToken -F /var/www/app/shared/config/puma.rb restart`

            puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Ending"

            break

          end

          sleep 10

        end
      end

    end

  end

end

calvinxiao · 2021-05-06T15:26:47Z

@feliperaul , I propose this temporary fix to your problem, see if it helps.

Seems not all fatal error is recoverable, it's better to kill the worker in your case.

lowlevel_error_handler do |e|
  Honeybadger.notify(e) if defined?(Honeybadger)
  if e.class.to_s == 'fatal'
    STDERR.puts "Worker[#{Process.pid}] encounters fatal error <#{e}>, sending SIGTERM to itself" 
    Process.kill('TERM', Process.pid)
  end

  [500, {}, ["Error\n"]]
end

feliperaul · 2021-05-24T12:28:46Z

@calvinxiao Calvin, thanks, your solution is much simpler and doesn't require spawning a new Thread for constant monitoring.

Today I checked my logs and my PumaStaleWorkersMonitor has saved us at least once from downtime. So, it's working!

Regardless, I will change to your proposed solution and report back any news.

Thanks to all you guys for helping me with this. I think future Googlers will find this issue helpful as well. Cheers.

nateberkopec added the bug label Feb 9, 2021

feliperaul mentioned this issue Feb 9, 2021

Thread puma threadpool terminated with exception machine stack overflow in critical region (fatal) mileszs/wicked_pdf#891

Open

calvinxiao mentioned this issue Apr 21, 2021

Handle fatal error that has no backtrace #2607

Merged

7 tasks

nateberkopec changed the title ~~All workers freeze after a fatal: machine stack overflow in critical region~~ Child workers becoming zombies/unreachable after exception Apr 23, 2021

nateberkopec closed this as completed in #2607 Apr 23, 2021

nateberkopec reopened this Apr 23, 2021

nateberkopec closed this as completed Apr 23, 2021

nateberkopec reopened this Apr 27, 2021

feliperaul closed this as completed May 24, 2021

ghiculescu mentioned this issue Aug 11, 2021

Return a Rack::BodyProxy from the Rails::Rack::Logger monkey patch roidrage/lograge#333

Merged

dentarg mentioned this issue Jan 27, 2022

Use the bundler shipped with ruby for non-MRI #2811

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Child workers becoming zombies/unreachable after exception #2552

Child workers becoming zombies/unreachable after exception #2552

feliperaul commented Feb 9, 2021 •

edited

feliperaul commented Feb 9, 2021

dentarg commented Feb 9, 2021

nateberkopec commented Feb 9, 2021

feliperaul commented Feb 9, 2021

nateberkopec commented Feb 9, 2021

feliperaul commented Feb 9, 2021

nateberkopec commented Feb 9, 2021

nateberkopec commented Feb 9, 2021

nateberkopec commented Feb 9, 2021

feliperaul commented Feb 9, 2021

calvinxiao commented Apr 20, 2021

calvinxiao commented Apr 21, 2021

MSP-Greg commented Apr 21, 2021 •

edited

nateberkopec commented Apr 23, 2021

feliperaul commented Apr 23, 2021

nateberkopec commented Apr 23, 2021

MSP-Greg commented Apr 23, 2021

feliperaul commented Apr 27, 2021

calvinxiao commented Apr 27, 2021

feliperaul commented Apr 27, 2021

MSP-Greg commented Apr 27, 2021 •

edited

feliperaul commented Apr 27, 2021

calvinxiao commented Apr 27, 2021

feliperaul commented Apr 27, 2021

feliperaul commented May 3, 2021

calvinxiao commented May 6, 2021 •

edited

feliperaul commented May 24, 2021

Child workers becoming zombies/unreachable after exception #2552

Child workers becoming zombies/unreachable after exception #2552

Comments

feliperaul commented Feb 9, 2021 • edited

feliperaul commented Feb 9, 2021

dentarg commented Feb 9, 2021

nateberkopec commented Feb 9, 2021

feliperaul commented Feb 9, 2021

nateberkopec commented Feb 9, 2021

feliperaul commented Feb 9, 2021

nateberkopec commented Feb 9, 2021

nateberkopec commented Feb 9, 2021

nateberkopec commented Feb 9, 2021

feliperaul commented Feb 9, 2021

calvinxiao commented Apr 20, 2021

calvinxiao commented Apr 21, 2021

MSP-Greg commented Apr 21, 2021 • edited

nateberkopec commented Apr 23, 2021

feliperaul commented Apr 23, 2021

nateberkopec commented Apr 23, 2021

MSP-Greg commented Apr 23, 2021

feliperaul commented Apr 27, 2021

calvinxiao commented Apr 27, 2021

feliperaul commented Apr 27, 2021

MSP-Greg commented Apr 27, 2021 • edited

feliperaul commented Apr 27, 2021

calvinxiao commented Apr 27, 2021

feliperaul commented Apr 27, 2021

feliperaul commented May 3, 2021

calvinxiao commented May 6, 2021 • edited

feliperaul commented May 24, 2021

feliperaul commented Feb 9, 2021 •

edited

MSP-Greg commented Apr 21, 2021 •

edited

MSP-Greg commented Apr 27, 2021 •

edited

calvinxiao commented May 6, 2021 •

edited