Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In Ruby on Rails v5.2 and Puma v4, after restarting puma at the time of logrotate, unnecessary workers continue to remain #1891

Closed
dounokouno opened this issue Aug 7, 2019 · 10 comments

Comments

@dounokouno
Copy link

We are uses Ruby on Rails v5.2 and Puma v4, rotating the Rails application log with logrotate, and restarting puma when it is rotated.

We uses puma restart with pumactl phased-restart, but we used to send SIGUSR1 signal to restart it.

Specifically, it is the following code.

test -s $pid && kill -USR1 "$(cat /path/to/app/tmp/pids/puma.pid)"

Refs: https://github.com/puma/puma/blob/master/docs/restart.md

Although puma restarts without any problem for a while, after several restarts, unnecessary workers will be left as shown below. In the following, there are two worker 7s. Perhaps pid = 15241 seems to be an unnecessary worker.

$ ps aux | grep [p]uma
username     10096 22.8  3.0 3357088 1012940 ?     Sl   13:01   1:12 puma: cluster worker 0: 14749 [app]
username     10135 26.6  3.0 3562476 1016716 ?     Sl   13:01   1:22 puma: cluster worker 1: 14749 [app]
username     10196 17.9  3.4 3688000 1145328 ?     Sl   13:01   0:54 puma: cluster worker 2: 14749 [app]
username     10243 15.6  3.9 3753452 1287424 ?     Sl   13:01   0:46 puma: cluster worker 3: 14749 [app]
username     10325 16.8  3.6 3758280 1189004 ?     Sl   13:01   0:48 puma: cluster worker 4: 14749 [app]
username     10381 15.9  3.1 3423564 1034904 ?     Sl   13:01   0:45 puma: cluster worker 5: 14749 [app]
username     10420 16.6  3.2 3689844 1070112 ?     Sl   13:01   0:46 puma: cluster worker 6: 14749 [app]
username     10464 21.9  2.6 3155296 888148 ?      Sl   13:01   0:59 puma: cluster worker 7: 14749 [app]
username     14749  0.0  0.0 216884 20340 ?        Sl    7月19   0:56 puma 4.0.0 (tcp://0.0.0.0:9292) [app]
username     15241 60.7  9.2 5273704 3049272 ?     Sl   11:02  75:20 puma: cluster worker 7: 14749 [app]

My question is, I would like to know how to restart puma without leaving unnecessary workers.

Or I would like to know if there is a way to find out what each worker is doing to find out why this is the case.

Or, please tell me if there is a way to put in logs etc and check the situation after the fact.

The main usage version is as follows.

  • Ruby: 2.4.2
  • Ruby on Rails: 5.2.2.1
  • Puma: 4.0.0

As mentioned above, puma restarts are performed by pumactl phased-restart, but the situation was the same by sending SIGUSR1 signal.

@nateberkopec
Copy link
Member

Not a lot for us to go on or reproduce here. All I can say based on what you've provided is that we're working very actively on restart behavior lately, and to try 4.1 when it releases (very soon).

We need some sort of script or simpler scenario that reproduces the issue you're seeing.

Is the duplicated worker(s) always the last one (i.e., is it always worker #7, or sometimes #1 or #0?)

@dounokouno
Copy link
Author

@nateberkopec

Thank you for your reply. I'm very much looking forward to the release of v4.1.

Is the duplicated worker(s) always the last one (i.e., is it always worker #7, or sometimes #1 or #0?)

No, not always the last one worker. Sometimes it may be #0 or #1.

@nateberkopec
Copy link
Member

Tagging @MSP-Greg here as well because he's been looking so much at restarts lately, wondering if this triggers any thoughts

@MSP-Greg
Copy link
Member

MSP-Greg commented Aug 8, 2019

he's been looking so much at restarts lately

Too much.

'phased-restart' relies on the logic in Cluster#check_workers to call wait on all workers. restart uses the logic in Cluster#stop_workers With the current code in Cluster#check_workers, if a worker is SIGTERM'd, I beleive it will be respawned, but wait (or some of the logic so force it to exit) may not be called. Hence, what you're seeing may occur.

@dounokouno
Copy link
Author

I have confirmed that v4.1 has been released 🎉 I would like to use it with our service.

https://github.com/puma/puma/releases/tag/v4.1.0

@dounokouno
Copy link
Author

@MSP-Greg

Thank you for your explanation.

Is there a way to properly stop unnecessary puma worker(s) generated using phased-restart? Or is there any way to do a phased restart other than phased-restart? By the way, the unnecessary worker(s) generated by this issue remained for hours unless it was deleted with kill -9.

@MSP-Greg
Copy link
Member

MSP-Greg commented Aug 9, 2019

Is there a way to properly stop unnecessary puma worker(s) generated using phased-restart?

That's certainly how Puma should work.

remained for hours

Obviously, no one wants that happening.

Speaking only for myself, am I 100% certain that this issue has been resolved in 4.1.0? No. Something similar to #1892 may be needed.

In some respects, phased-restart and externally stopping workers are similar, in that they both involve 'cleaning up' and starting a new worker/workers.

But, the code paths are not identical, and there may be cases where 'misbehaving' workers are not handled correctly. There's also the issue of when should a new worker be created. Should it be created when another worker is 'told to stop', or when the other worker has actually exited?

Also, should the timers involved for 'external stop' versus 'phased-restart' be the same, or different?

Sorry for the brief rant, but it is messy, and testing for it is also messy. If there are still issues remaining, they will get quickly fixed...

@dounokouno
Copy link
Author

@MSP-Greg

Thank you for many replies.

First, I report that the symptoms did not improve after updating to v4.1.

Does "externally stopping" mean "send SIGUSR1 signal"?

Both "phased-restart" and "send SIGUSR1 signal" are executed with logrotate, and the rotation interval is 1 hour.

Is it possible to output a log of what each worker does when phased-restart is executed? I would like to be able to investigate the timing at which the worker continues without stopping by outputting logs such as stop command received, stop started, etc.

@nateberkopec
Copy link
Member

Closed by #1908.

@dounokouno
Copy link
Author

I will report.

After puma v4.1.1 was released, I immediately updated to v4.1.1. After that, there was no situation where unnecessary workers remained for about a month.

Thank you for your quick response and daily maintenance work. We will continue to use puma in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants