Request for help with 502 errors from Nginx - Puma worker rebooting #3193

ndbroadbent · 2023-07-11T05:16:30Z

ndbroadbent
Jul 11, 2023

Hello, I'm using puma 6.3.0. (Was previously on 5.6.4 but just upgraded, issue is still happening.)

One of my customers has let me know about some 502 error responses that they are receiving from one of our API endpoints. This endpoint handles the batch creation of records and can process significant amounts of JSON data and instantiates lots of records, so it's not too surprising that this is the one that's causing problems. (I don't think it's running out of memory though, since the pod memory utilization is hovering around 36%.)

Here is my puma config: https://gist.github.com/ndbroadbent/7cddae9176fb30e7b30380299306773b

I've dug into the CloudWatch logs and tried to piece together what is happening. Here's the timeline of logs for one of these 502 errors (sanitized):

2023-07-07T17:54:16.532Z - Rails - 201 response
- 17:54:16 web.1 | I, [2023-07-07T17:54:16.531827 #2102] INFO -- : [8cc3c77f579dc14843daf850d0026e9e] method=POST path=/api/v1/.../batch format=json controller=... action=batch_create status=201 duration=2090.14 view=2.12 db=201.16 request_id=8cc3c77f579dc14843daf850d0026e9e ... user_agent=axios/0.24.0 content_type=application/json accept=application/json
2023-07-07T17:54:16.537Z - Nginx
- [**error**] 309#309: *4020098 upstream prematurely closed connection while reading response header from upstream, client: 52.88.213.39, server: [...](http://...), request: "POST /api/v1/.../batch HTTP/1.1", upstream: "http://*.*.*.*:4001/api/v1/.../batch", host: "[...](http://.../)"
2023-07-07T17:54:16.576Z - Rails - Booted new Puma worker
- 17:54:16 web.1 | [19] - Worker 0 (PID: 2151) booted in 0.01s, phase: 0
2023-07-07T17:54:16.607Z - Nginx - Returns 502 error
- 52.88.213.39 - api_... [07/Jul/2023:17:54:16 +0000] "POST /api/v1/.../batch HTTP/1.1" 502 150 "-" "axios/0.24.0" 220981 2.107 [production-...-web-4001] [] 10.1.233.221:4001 0 1.720 502 3445b2c521f46c11c9705eccdd8d5d00

Here's what I think happened, starting from 17:54:16.532:

0ms - server returns a successful 201 response
+5ms - Nginx logs the error "upstream prematurely closed connection while reading response header from upstream". I think this is when the Puma worker crashed
+44ms - New puma worker is booted
+75ms - Nginx returns 502 response to the client

I don't see any other logs that might be helpful. I also looked at my metric dashboards and didn't see any unusual CPU or memory usage, and I can see that k8s didn't restart the pod or launch any new pods during this time.

I also wasn't aware of the lowlevel_error_handler option, so I've just set that up to report errors to Sentry. I've also just set up puma-cloudwatch to see if I can get more information about what puma is doing.

I'm just wondering why I don't see any errors or stack traces in the logs from puma (or even Ruby.) The only line I can see is:

17:54:16 web.1     | [19] - Worker 0 (PID: 2151) booted in 0.01s, phase: 0

Does puma log everything to stdout by default, including errors? (I checked the log directory in my web containers and I don't see any files.) Should I be seeing some kind of error or stack trace in here, or is it possible for puma to reboot the worker silently?
Does puma ever reboot workers on purpose? (e.g. if they start using too much memory?)
Do you have any other suggestions?

Thanks in advance for your help!

Answered by ndbroadbent

Jul 17, 2023

Thanks @MSP-Greg, your note about "phase: 0" was really helpful! I was able to figure out that the OS was killing the puma worker process since it was out of memory. I've increased the memory for my pods and have set up alarms and dashboards (and learned a lot about AWS CloudWatch logs and metrics!) Once I figured out the metric I should be looking at (pod_memory_utilization_over_pod_limit), it was clear that I wasn't giving it enough memory.

It must have crept up over time with new libraries, upgraded dependencies, etc.

I'm still a bit frustrated that I didn't see anything in the logs about this, and it looks like I have to set this up manually if I want more visibility. (Something spec…

View full answer

dentarg · 2023-07-11T07:59:45Z

dentarg
Jul 11, 2023
Maintainer

Can you also share your Puma config?

1 reply

ndbroadbent Jul 12, 2023
Author

Sure here is my puma config: https://gist.github.com/ndbroadbent/7cddae9176fb30e7b30380299306773b

Note: I'm calling the DSL a little bit differently to work around a problem with https://www.rubyencoder.com. (It crashes with the standard way of configuring puma.)

dentarg · 2023-07-11T08:00:39Z

dentarg
Jul 11, 2023
Maintainer

... action=batch_create status=201 duration=2090.14 ...

What unit is that duration value?

1 reply

ndbroadbent Jul 12, 2023
Author

That's in milliseconds

dentarg · 2023-07-11T08:02:37Z

dentarg
Jul 11, 2023
Maintainer

I think it would be useful to ensure you have request logging enabled in Puma while troubleshooting this https://github.com/puma/puma/blob/v5.6.4/lib/puma/dsl.rb#L364-L368

1 reply

ndbroadbent Jul 12, 2023
Author

Thanks for the tip! I will enable that option

ndbroadbent · 2023-07-13T12:46:05Z

ndbroadbent
Jul 13, 2023
Author

iI've upgraded to the latest puma version (6.3.0). I enabled request logging and set up lowlevel_error_handler to send errors to Sentry, but these haven't helped yet. Latest error from Nginx:

2023/07/13 12:30:10 [error] 31#31: *13655606 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: ...

This is the only corresponding line in my Rails app logs:

12:30:10 web.1     | [38] - Worker 0 (PID: 298) booted in 0.01s, phase: 0

There's no other error or stack trace telling me why the puma worker booted again.

10 minutes later I did see this error reported to Sentry:

Puma::HttpParserError - Invalid HTTP format, parsing fails. Are you trying to open an SSL connection to a non-SSL Puma?

I don't think that's related though, they're too far apart.

@dentarg Are you aware of any way that a puma worker can be rebooted silently without logging anything?

0 replies

ndbroadbent · 2023-07-13T13:16:45Z

ndbroadbent
Jul 13, 2023
Author

I found this post on StackOverflow where people are reporting the same issue: https://stackoverflow.com/questions/70337289/how-to-find-the-root-cause-of-spontaneous-restarts-of-puma-worker

They are also running puma in a k8s cluster, so maybe this has something to do with it. Something to do with memory/CPU utilization or k8s doing something to manage the processes running in the pod? I'm not too sure if this could be related

1 reply

dentarg Jul 13, 2023
Maintainer

Is it something that kills the Puma worker? Can you check your system logs? Maybe OOM

MSP-Greg · 2023-07-13T18:14:27Z

MSP-Greg
Jul 13, 2023
Maintainer

@ndbroadbent

From above:

12:30:10 web.1     | [38] - Worker 0 (PID: 298) booted in 0.01s, phase: 0

There's no other error or stack trace telling me why the puma worker booted again.

What's confusing is that the log is showing phase: 0, which normally means it's not 'booted again', but the first boot (except if a hard restart has been done).

I'm not sure if this might mean the app isn't yet properly initialized, and the worker has started its listen loop? Not.sure.

Can you confirm how many workers are used? Also, do the other workers show any log entries?

0 replies

ndbroadbent · 2023-07-17T15:27:12Z

ndbroadbent
Jul 17, 2023
Author

Thanks @MSP-Greg, your note about "phase: 0" was really helpful! I was able to figure out that the OS was killing the puma worker process since it was out of memory. I've increased the memory for my pods and have set up alarms and dashboards (and learned a lot about AWS CloudWatch logs and metrics!) Once I figured out the metric I should be looking at (pod_memory_utilization_over_pod_limit), it was clear that I wasn't giving it enough memory.

It must have crept up over time with new libraries, upgraded dependencies, etc.

I'm still a bit frustrated that I didn't see anything in the logs about this, and it looks like I have to set this up manually if I want more visibility. (Something special I need to do to get syslog running and reporting OOM errors in the logs.) But that's not a problem with puma. Thanks for your help!

2 replies

MSP-Greg Jul 17, 2023
Maintainer

Glad that helped.

I'm still a bit frustrated that I didn't see anything in the logs about this

We can probably have a look at better logging when a worker is 'externally' shutdown. We may be able to log it from the master process. Not sure, as OOM may also affect the master process...

iox Mar 14, 2024

Thanks for this! We had the exact same issue

ndbroadbent · 2023-07-26T00:39:54Z

ndbroadbent
Jul 26, 2023
Author

Hi @MSP-Greg, I've been spending some more time on performance and have been trying to figure out why memory usage is steadily increasing for my puma workers. I've been able to solve my 502 error problem for now by increasing the available memory and fixing a few performance issues in the app, and haven't had any workers killed due to OOM errors. But I still want to get my memory usage stable and not increase over time. So I thought I would just reopen this discussion to share some more of my findings.

I should also mention that I'm on Ruby 2.7, and we are currently working on a Ruby 3.x upgrade, so these memory issues might be improved or solved on Ruby 3.x.

I've been tracking ObjectSpace.count_objects as CloudWatch metrics, and found something suprising:

ObjectSpace.count_objects[:TOTAL] stays steady at 2,502,209 total objects, but my memory usage keeps creeping up slowly. These jumps are caused by a series of ~50 API requests that a customer is making every 30 minutes.

So now I'm thinking that the problem is actually memory fragmentation, and not a memory leak.

I've read a few really interesting articles along the way:

It looks like I might need to regularly call GC.compact to keep my memory usage stable. I saw that puma calls this once before forking, but I would like to periodically call it after requests have completed. A few questions:

Does puma call GC.compact anywhere else?
Does puma or Rails have a callback where I can call GC.compact after a request has finished being sent to the client, but before accepting the next request?
What do you think about the idea of (ab)using my health check endpoint (/health) to call GC.compact after every X requests (maybe 100?) I was thinking it could be good to put it in here since it's only used internally and wouldn't slow down any customer requests.

Do you have any other advice or suggestions for calling GC.compact regularly from puma workers, or generally from a Rails app? Thanks!

0 replies

nateberkopec · 2023-07-26T06:48:23Z

nateberkopec
Jul 26, 2023
Maintainer

Try switching to jemalloc instead. GC.compact hasn't really shown a lot of real world results for me.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for help with 502 errors from Nginx - Puma worker rebooting #3193

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Request for help with 502 errors from Nginx - Puma worker rebooting #3193

ndbroadbent Jul 11, 2023

Replies: 9 comments · 6 replies

dentarg Jul 11, 2023 Maintainer

ndbroadbent Jul 12, 2023 Author

dentarg Jul 11, 2023 Maintainer

ndbroadbent Jul 12, 2023 Author

dentarg Jul 11, 2023 Maintainer

ndbroadbent Jul 12, 2023 Author

ndbroadbent Jul 13, 2023 Author

ndbroadbent Jul 13, 2023 Author

dentarg Jul 13, 2023 Maintainer

MSP-Greg Jul 13, 2023 Maintainer

ndbroadbent Jul 17, 2023 Author

MSP-Greg Jul 17, 2023 Maintainer

iox Mar 14, 2024

ndbroadbent Jul 26, 2023 Author

nateberkopec Jul 26, 2023 Maintainer

ndbroadbent
Jul 11, 2023

Replies: 9 comments 6 replies

dentarg
Jul 11, 2023
Maintainer

ndbroadbent Jul 12, 2023
Author

dentarg
Jul 11, 2023
Maintainer

ndbroadbent Jul 12, 2023
Author

dentarg
Jul 11, 2023
Maintainer

ndbroadbent Jul 12, 2023
Author

ndbroadbent
Jul 13, 2023
Author

ndbroadbent
Jul 13, 2023
Author

dentarg Jul 13, 2023
Maintainer

MSP-Greg
Jul 13, 2023
Maintainer

ndbroadbent
Jul 17, 2023
Author

MSP-Greg Jul 17, 2023
Maintainer

ndbroadbent
Jul 26, 2023
Author

nateberkopec
Jul 26, 2023
Maintainer