Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Child workers becoming zombies/unreachable after exception #2552

Closed
feliperaul opened this issue Feb 9, 2021 · 27 comments · Fixed by #2607
Closed

Child workers becoming zombies/unreachable after exception #2552

feliperaul opened this issue Feb 9, 2021 · 27 comments · Fixed by #2607
Labels

Comments

@feliperaul
Copy link

feliperaul commented Feb 9, 2021

Describe the bug
Every once in a while (it's happening aprox. monthly), we are experiencing downtime due to puma freezing completely.

When I ssh into the server, I see all 25 workers are there, sitting idle with 0 CPU usage.

Investigating the logs, puma.stderr and puma.stdout give me nothing; but in production.log, I can always trace back the last request served, and this is what I get in Honeybadger:

fatal: machine stack overflow in critical region

Backtrace: middleware_tracing.rb  > 99 >  call(...)

97 | events.notify(:before_call, env) if first_middleware
-- | --
98 |  
99 | result = (target == self) ? traced_call(env) : target.call(env)
100 |  
101 | if first_middleware

It seems that this exception might be caused by WickedPDF (and remotipart together?) - see mileszs/wicked_pdf#891 and mileszs/wicked_pdf#810.

However, despite being a long shot, I'm opening this issue here in Puma because maybe something could be done by Puma to prevent the entire worker pool halting to a stall.

A simple sudo systemctl restart puma.service puts us back online.

After the incident, I can never reproduce the problem manually, not even hitting the same URL that triggered the "machine stack overflow in critical region".

Puma config:

Puma version 5.0.4

[Unit]
Description=Puma {{ param_app_name }} application server
After=network.target

# Socket activation
Requires=puma-{{ param_app_name}}.socket

[Service]
Type=simple

User={{ param_user_name }}

WorkingDirectory=/var/www/{{ param_app_name }}/application

ExecStart=/home/{{ param_user_name }}/.rbenv/shims/bundle exec --keep-file-descriptors puma -C /var/www/{{ param_app_name }}/shared/config/puma.rb

Restart=always

[Install]
WantedBy=multi-user.target
[Unit]
Description=Puma HTTP Server Accept Sockets for {{ param_app_name }}

[Socket]
SocketUser={{ param_user_name }}
SocketGroup={{ param_user_name }}
ListenStream=/var/www/{{ param_app_name }}/shared/tmp/sockets/puma-{{ param_app_name }}.sock

# Socket options matching Puma defaults
NoDelay=true
ReusePort=true
Backlog=1024

[Install]
WantedBy=sockets.target
shared_dir = "/var/www/{{ param_app_name }}/shared"

FileUtils.mkdir_p("#{shared_dir}/tmp/pids")

workers 24
threads 1, 1

bind "unix://#{shared_dir}/tmp/sockets/puma-{{ param_app_name }}.sock"

environment "production"

stdout_redirect "#{shared_dir}/log/puma.stdout.log", "#{shared_dir}/log/puma.stderr.log", true

pidfile "#{shared_dir}/tmp/pids/puma.pid"

before_fork do
  ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end

on_worker_boot do

  ActiveRecord::Base.establish_connection if defined?(ActiveRecord)

  if defined?(MessageBus)
    MessageBus.after_fork
  end

end

lowlevel_error_handler do |e|
  Honeybadger.notify(e) if defined?(Honeybadger)
  [500, {}, ["Error\n"]]
end

To Reproduce
Unfortunately, it happens completely at random.

Desktop (please complete the following information):
Ubuntu 18.04.LTS
Puma version 5.0.4

@feliperaul
Copy link
Author

Btw, we just upgraded puma to 5.2.1. I will report back if/when it happens again.

@dentarg
Copy link
Member

dentarg commented Feb 9, 2021

@feliperaul I'm curious, what wicked_pdf version are you using?

(I'm doubtful that Puma can do something about the problem you are experiencing)

@nateberkopec
Copy link
Member

Hm. I'm confused as to why an exception in one child worker would lead to all of the child processes stopping. It's kind of a weird part of the Ruby code, I'm not sure if that's really an exception or a segfault, it appears to be neither: https://github.com/ruby/ruby/blob/98e27016c93455d4e9e208d0666d85929cb62857/vm_insnhelper.c#L91

@feliperaul
Copy link
Author

@feliperaul I'm curious, what wicked_pdf version are you using?

(I'm doubtful that Puma can do something about the problem you are experiencing)

I was on version 2.1.0.

@nateberkopec

Looking again on the exception on Honeybadger, I had 24 occurences of that exception yesterday, spread between 10 hours; I have exactly 24 workers, so this must not be a coincidence.

Analyzing the logs alongside, I can confirm that after the LAST exception we stopped processing all requests until I woke up this morning and restarted puma.

So what probably happened is that on every one of this exceptions, that specific child worker halted and stopped processing requests, so capacity was reducing during the day but we only noticed when the last child worker suffered it.

I know puma is not causing this, but it would be great if this exception/segfault could be somehow rescued instead of freezing the worker, or at least sending a signal to the master process to re-spawn that worker.

@nateberkopec
Copy link
Member

Cool. Thanks.

Well, the child can't rescue it's own segfault. It can rescue it's own exceptions, but this type of thing can't be rescued. That worker is dead and must be killed and restarted.

I'll have to look at how our child process heartbeat code works again. I don't think it's unreasonable for a master process to try to kill and restart an unresponsive child.

@feliperaul
Copy link
Author

@nateberkopec that would be awesome and bring even greater resiliency to Puma.

@nateberkopec
Copy link
Member

This is the mechanism that is supposed to cull and then restart a hung worker.

It should work in this case, so I am marking this as a bug.

@nateberkopec
Copy link
Member

Part of the problem with fixing this is going to be that I have no idea how to artificially trigger this exception.

@nateberkopec
Copy link
Member

This report makes me wonder if the problem is that the child process has no (live) threads in the threadpool.

@feliperaul
Copy link
Author

@nateberkopec Nate, that can be the case because our setup is single-threaded (threads 1, 1)

The report link you posted is a translation of this github issue here: mileszs/wicked_pdf#891

@calvinxiao
Copy link
Contributor

@feliperaul

I am not able to reproduce machine stack overflow in worker, but the following code can cause child workers to freeze longer than timeout

require('pty')

puts "slowing worker #{Process.pid}"
PTY.spawn("ruby -e 'sleep 100000'") do |r, w, pid|
    Process.wait pid
end

When spawning a new process to do heavy work, it's better to set a timeout. I fixed a similar process hanging problem when using ImageMagick to resize images for a CDN service.

If wkhtml2pdf doesn't have a timeout option, you can do it with Timeout::timeout.

require('pty')
require('timeout')

puts "slowing worker #{Process.pid}"
PTY.spawn("ruby -e 'sleep 100000'") do |r, w, pid|
    begin
        Timeout::timeout(1) do 
            Process.wait pid
        end
    rescue Timeout::Error
        puts "child process timeout"
        Process.kill "KILL", pid
    end
end

I put the code just before @app.call(env) request.rb to verify it.

Hope it's helpful.

@calvinxiao
Copy link
Contributor

@nateberkopec JSON.dump and cyclic hash can cause fatal: machine stack overflow in critical region

Being able to reproduce the exception, I found that when e.class == "fatal" there is no #backtrace. and in lowlevel_error it assumes e.backtrace is an array.

Here is the log when fatal: machine stack overflow in critical region happens

2021-04-21 09:48:15 +0800 Rack app ("GET /" - (127.0.0.1)): #<fatal: machine stack overflow in critical region>
before client_error NoMethodError
Error reached top of thread-pool: undefined method `join' for nil:NilClass (NoMethodError)
["/home/eagle/github/my/puma/lib/puma/server.rb:530:in `lowlevel_error'", 
"/home/eagle/github/my/puma/lib/puma/server.rb:503:in `client_error'", 
"/home/eagle/github/my/puma/lib/puma/server.rb:471:in `rescue in process_client'", 
"/home/eagle/github/my/puma/lib/puma/server.rb:468:in `process_client'", 
"/home/eagle/github/my/puma/lib/puma/thread_pool.rb:145:in `block in spawn_thread'"]

I think the fix is to check if e.backtrace is nil.

@MSP-Greg
Copy link
Member

MSP-Greg commented Apr 21, 2021

@calvinxiao

Thanks. To clarify, e.backtrace is nil, as opposed to undefined?

EDIT: Found a place in the Ruby docs that states In the case no backtrace has been set, nil is returned...

@nateberkopec nateberkopec changed the title All workers freeze after a fatal: machine stack overflow in critical region Child workers becoming zombies/unreachable after exception Apr 23, 2021
@nateberkopec
Copy link
Member

To close this issue, we need to find a way to detect this condition in a child worker, kill it, and start a new one.

@feliperaul
Copy link
Author

@nateberkopec @calvinxiao Thanks a lot for this guys, we'll update puma and report back. Didn't comment further because I currently don't know puma internals enough to be helpful here.

@nateberkopec nateberkopec reopened this Apr 23, 2021
@nateberkopec
Copy link
Member

Reopening as I think @calvinxiao marked this PR incorrectly as closing this issue. Printing a backtrace accurately is helpful, but I'm not sure it actually fixes the timeout/child processes not accepting traffic. Calvin, thoughts?

@MSP-Greg
Copy link
Member

Near the top of the first post, it shows a HoneyBadger log with

fatal: machine stack overflow in critical region

Server#lowlevel_error calls the handler (HoneyBadger) specified by @options[:lowlevel_error_handler], then logs the backtrace. But, with that error backtrace is nil, so Puma was throwing an error. I think.

@feliperaul
Copy link
Author

Unfortunately #2607 didn't fix it :(

I upgraded puma in production to gem 'puma', git: 'https://github.com/puma/puma', ref: 'a02f9241aef8a2e8814c745e84457ca0eabe5e8d', to have the #2607 fix in it.

Today I got two fatal: machine stack overflow in critical region exceptions in Honeybadger.

I decided to take a look at pumactl (something I hadn't done before) and, not coincidently, there are exactly 2 workers whose status running is 0, like this:

{
      "started_at": "2021-04-25T16:54:11Z",
      "pid": 30222,
      "index": 22,
      "phase": 2,
      "booted": true,
      "last_checkin": "2021-04-27T03:26:31Z",
      "last_status": {
        "backlog": 0,
        "running": 0,
        "pool_capacity": 1,
        "max_threads": 1,
        "requests_count": 35443
      }
    },
{
      "started_at": "2021-04-25T16:54:02Z",
      "pid": 30093,
      "index": 19,
      "phase": 2,
      "booted": true,
      "last_checkin": "2021-04-27T03:26:32Z",
      "last_status": {
        "backlog": 0,
        "running": 0,
        "pool_capacity": 1,
        "max_threads": 1,
        "requests_count": 33101
      }
    },

All others show running: 1.

The pid indicated in the Honeybadger environment matches the pids of the workers running: 0.

So, if pumactl actually knows that these two workers are not running, I think we still have a bug if it's not recovering those two workers, right?

Would it be a silly idea to have a while true; (check all workers that running are 0 and respawn them); sleep 1 sort of thing, even if acting as backup?

BTW, here's the honeybadger stack trace in full:

/var/www/app/shared/config/puma.rb:48 :in `block in _load_from`
[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/server.rb:522 :in `lowlevel_error`
[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/request.rb:98 :in `rescue in handle_request`
[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/request.rb:74 :in `handle_request`
[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/server.rb:433 :in `process_client`
[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/thread_pool.rb:145 :in `block in spawn_thread`
More (1 line)
Nested Exceptions
fatal: machine stack overflow in critical region
/var/www/app/shared/config/puma.rb:48 :in `block in _load_from`

@nateberkopec Could you please reopen?

@nateberkopec nateberkopec reopened this Apr 27, 2021
@calvinxiao
Copy link
Contributor

[GEM_ROOT]/bundler/gems/puma-a02f9241aef8/lib/puma/server.rb:522 :in lowlevel_error

return handler.call(e)

519    def lowlevel_error(e, env, status=500)
520      if handler = @options[:lowlevel_error_handler]
521        if handler.arity == 1
522          return handler.call(e)

Hi @feliperaul , it looks like you have lowlevel_error_handler option, can you let let me know how you define this option block?

@feliperaul
Copy link
Author

sure @calvinxiao:

lowlevel_error_handler do |e|
  Honeybadger.notify(e) if defined?(Honeybadger)
  [500, {}, ["Error\n"]]
end

I pasted my entire puma.rb config file at the inaugural issue comment, you can check it in it's entirety here #2552 (comment)

@MSP-Greg
Copy link
Member

MSP-Greg commented Apr 27, 2021

@feliperaul

'running' may not be the best name. It is actually the number of threads created in the worker. So, if you have min_threads set to zero, zero is a valid value for 'running'.

EDIT: I've always wondered about the benefit of setting min_threads to zero instead of one, especially with a high number of workers and max_threads set to one.

With it set to zero, if the worker is needed, it has to create a thread...

@feliperaul
Copy link
Author

@MSP-Greg got it.

But then, what if puma checked if the running value is < min_threads defined in the config, and then tried to respawn this worker, so it doesn't sit there idling?

Is there a mechanism for ensuring this already (workers with 0 threads being recovered) and it's failing in this exception, or this mechanism doesn't yet exist?

@calvinxiao
Copy link
Contributor

I can reproduce the same logs from Honeybadger using the below app.rb, command bundle exec bin/puma -w 24 -t 1:1 -C test/config/app.rb --fork-worker

require 'honeybadger'

port ENV['PORT'] if ENV['PORT']

app do |env|
  require 'json'
  c = {}
  c[:c] = c
  ::JSON.dump(c)
  [200, {}, ["embedded app"]]
end


lowlevel_error_handler do |err|
  Honeybadger.notify(err)
  [500, {}, ["Error\n"]]
end

When I use wrk to stress test the local server, here are the suspusious logs I got:

** [Honeybadger] Error report failed: an unknown error occurred. code=error error="HTTP Error: Net::OpenTimeout" level=2 pid=20171

[20090] - Stopping 20123 for phased upgrade...
[20090] - Worker 6 (PID: 20796) booted in 0.0s, phase: 1
[20090] - Stopping 20132 for phased upgrade...
[20090] - TERM sent to 20132...

I would suggest you try preload_app!, fork_worker is still experimental according to the Doc. If worker 0 is unreachable, forking it won't make sense.

I can reproduce the hanging workers on WSL2 by these steps:

  1. run bundle exec bin/puma -w 24 -t 1:1 -C test/config/app.rb --fork-worker
  2. close the terminal

Reopen the terminal ps -ef shows only puma cluster workers [0-23], no main process, server will not accept any requests.

Not sure if Honeybadger will raise any exception, try Honeybadger.notify(e) if defined?(Honeybadger) rescue nil if possible.

There is a plugin of Honeybadger for passenger, It's better to start/stop Honeybadger in before_fork on_worker_boot and on_worker_shutdown

@feliperaul
Copy link
Author

@calvinxiao

  • Calvin, my mistake: I just reviewed my puma.rb config and we haven't been using fork_worker for well over one year. We use Ansible to template that puma.rb file, and that fork_worker line was inside an if clause that we currently have set to false, so it was my mistake to paste it here. I just edited the inaugural comment to remove the mention to fork_worker to not mislead anyone looking at this.

  • Regarding preload_app!, we can't use it because it's incompatible with phased restarts, which we now rely heavily; (a hot restart were stopping all our workers for 10+ seconds during the restart - requests were sitting there waiting due to socket activation; on the other hand, phased restart is flawless and doesn't disturb our users in any way. Amazing feature.

  • I updated puma.rb to rescue nil the Honeybadger line, as you suggested; the Honeybadger gem doesn't need to be started or stooped, it just works. To be honest, we can even remove the lowlevel_error_handler from our puma config.

Now that the issue is reproducible, what do you all think about the idea of checking periodically if the running threads is < than the min_threads config and, if so, respawn that worker?

@feliperaul
Copy link
Author

Woke up last morning with the entire application down due to the same fatal misterious exception again. As you can see the exception don't happen all at once, but in a few minutes they were enough to freeze our 24 worker pool:

image

Took some time to go through Puma's source code. First time I did this so bear with me :)

Puma seems to already have a bunch of checks for the health of workers and their corresponding thread pool, but I didn't find any check for the situation we're facing here (which means, booted workers with 0 threads even tough threads 1,1 was set, which means at least one thread should exist for each worker.

So it seems that the check_workers method here or the reap method here could be two suitable places for checking for the present situation. (Maybe the alive? method being called on the worker in the reap method already tries to do this, but I really couldn't find the alive? method definition).

Anyways, I coded a quick fix to at least achieve the desired effect for now. Maybe this can help any future googlers as well. Here's my new puma.rb file:

before_fork do
  ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)

  require_relative '../../application/app/services/puma_stale_workers_monitor' # adjust this to match your folder structure

  PumaStaleWorkersMonitor.start

end

And in /app/services/puma_stale_workers_monitor.rb.

require 'json'
require 'time'

class PumaStaleWorkersMonitor

  def self.start

    Monitor.new.start

  end

  class Monitor

    def start

      Thread.new do

        puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Starting"

        loop do
          
          puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Monitoring"

          stats = Puma.stats_hash

          if stats[:booted_workers].to_i > 0 &&
            stats.fetch(:worker_status).any? {|worker| worker.fetch(:last_status).has_key?(:running) && worker.fetch(:last_status).fetch(:running) == 0 }

            puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Restarting puma "

            puts JSON.pretty_generate(stats)

            `pumactl -C 'tcp://127.0.0.1:9293' -T myControlToken -F /var/www/app/shared/config/puma.rb restart`

            puts "[PumaStaleWorkersMonitor::#{Time.now.to_s}] Ending"

            break

          end

          sleep 10

        end
      end

    end

  end

end

@calvinxiao
Copy link
Contributor

calvinxiao commented May 6, 2021

@feliperaul , I propose this temporary fix to your problem, see if it helps.

Seems not all fatal error is recoverable, it's better to kill the worker in your case.

lowlevel_error_handler do |e|
  Honeybadger.notify(e) if defined?(Honeybadger)
  if e.class.to_s == 'fatal'
    STDERR.puts "Worker[#{Process.pid}] encounters fatal error <#{e}>, sending SIGTERM to itself" 
    Process.kill('TERM', Process.pid)
  end

  [500, {}, ["Error\n"]]
end

@feliperaul
Copy link
Author

@calvinxiao Calvin, thanks, your solution is much simpler and doesn't require spawning a new Thread for constant monitoring.

Today I checked my logs and my PumaStaleWorkersMonitor has saved us at least once from downtime. So, it's working!

Regardless, I will change to your proposed solution and report back any news.

Thanks to all you guys for helping me with this. I think future Googlers will find this issue helpful as well. Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants