Request for help with 502 errors from Nginx - Puma worker rebooting #3193
-
Hello, I'm using puma 6.3.0. (Was previously on 5.6.4 but just upgraded, issue is still happening.) One of my customers has let me know about some 502 error responses that they are receiving from one of our API endpoints. This endpoint handles the batch creation of records and can process significant amounts of JSON data and instantiates lots of records, so it's not too surprising that this is the one that's causing problems. (I don't think it's running out of memory though, since the pod memory utilization is hovering around 36%.) Here is my puma config: https://gist.github.com/ndbroadbent/7cddae9176fb30e7b30380299306773b I've dug into the CloudWatch logs and tried to piece together what is happening. Here's the timeline of logs for one of these 502 errors (sanitized):
Here's what I think happened, starting from 17:54:16.532:
I don't see any other logs that might be helpful. I also looked at my metric dashboards and didn't see any unusual CPU or memory usage, and I can see that k8s didn't restart the pod or launch any new pods during this time. I also wasn't aware of the I'm just wondering why I don't see any errors or stack traces in the logs from puma (or even Ruby.) The only line I can see is:
Thanks in advance for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 6 replies
-
Can you also share your Puma config? |
Beta Was this translation helpful? Give feedback.
-
What unit is that |
Beta Was this translation helpful? Give feedback.
-
I think it would be useful to ensure you have request logging enabled in Puma while troubleshooting this https://github.com/puma/puma/blob/v5.6.4/lib/puma/dsl.rb#L364-L368 |
Beta Was this translation helpful? Give feedback.
-
iI've upgraded to the latest puma version (6.3.0). I enabled request logging and set up
This is the only corresponding line in my Rails app logs:
There's no other error or stack trace telling me why the puma worker booted again. 10 minutes later I did see this error reported to Sentry:
I don't think that's related though, they're too far apart. @dentarg Are you aware of any way that a puma worker can be rebooted silently without logging anything? |
Beta Was this translation helpful? Give feedback.
-
I found this post on StackOverflow where people are reporting the same issue: https://stackoverflow.com/questions/70337289/how-to-find-the-root-cause-of-spontaneous-restarts-of-puma-worker They are also running puma in a k8s cluster, so maybe this has something to do with it. Something to do with memory/CPU utilization or k8s doing something to manage the processes running in the pod? I'm not too sure if this could be related |
Beta Was this translation helpful? Give feedback.
-
From above:
What's confusing is that the log is showing I'm not sure if this might mean the app isn't yet properly initialized, and the worker has started its listen loop? Not.sure. Can you confirm how many workers are used? Also, do the other workers show any log entries? |
Beta Was this translation helpful? Give feedback.
-
Thanks @MSP-Greg, your note about "phase: 0" was really helpful! I was able to figure out that the OS was killing the puma worker process since it was out of memory. I've increased the memory for my pods and have set up alarms and dashboards (and learned a lot about AWS CloudWatch logs and metrics!) Once I figured out the metric I should be looking at ( It must have crept up over time with new libraries, upgraded dependencies, etc. I'm still a bit frustrated that I didn't see anything in the logs about this, and it looks like I have to set this up manually if I want more visibility. (Something special I need to do to get syslog running and reporting OOM errors in the logs.) But that's not a problem with puma. Thanks for your help! |
Beta Was this translation helpful? Give feedback.
-
Hi @MSP-Greg, I've been spending some more time on performance and have been trying to figure out why memory usage is steadily increasing for my puma workers. I've been able to solve my 502 error problem for now by increasing the available memory and fixing a few performance issues in the app, and haven't had any workers killed due to OOM errors. But I still want to get my memory usage stable and not increase over time. So I thought I would just reopen this discussion to share some more of my findings. I should also mention that I'm on Ruby 2.7, and we are currently working on a Ruby 3.x upgrade, so these memory issues might be improved or solved on Ruby 3.x. I've been tracking
So now I'm thinking that the problem is actually memory fragmentation, and not a memory leak. I've read a few really interesting articles along the way:
It looks like I might need to regularly call
Do you have any other advice or suggestions for calling |
Beta Was this translation helpful? Give feedback.
-
Try switching to jemalloc instead. GC.compact hasn't really shown a lot of real world results for me. |
Beta Was this translation helpful? Give feedback.
Thanks @MSP-Greg, your note about "phase: 0" was really helpful! I was able to figure out that the OS was killing the puma worker process since it was out of memory. I've increased the memory for my pods and have set up alarms and dashboards (and learned a lot about AWS CloudWatch logs and metrics!) Once I figured out the metric I should be looking at (
pod_memory_utilization_over_pod_limit
), it was clear that I wasn't giving it enough memory.It must have crept up over time with new libraries, upgraded dependencies, etc.
I'm still a bit frustrated that I didn't see anything in the logs about this, and it looks like I have to set this up manually if I want more visibility. (Something spec…