Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Puma 4.3.6 - Rails 6.0.3.4 - TLS brings in memory leak #2425

Closed
Melchizedech opened this issue Oct 13, 2020 · 17 comments
Closed

Puma 4.3.6 - Rails 6.0.3.4 - TLS brings in memory leak #2425

Melchizedech opened this issue Oct 13, 2020 · 17 comments

Comments

@Melchizedech
Copy link

Melchizedech commented Oct 13, 2020

Describe the bug
Creating a "hello world" rack application and configuring puma to run using TLS, excessive memory is used by the application.

Puma config:

threads 1, 1
workers 1
ssl_bind '0.0.0.0', 3000, {
    key: "certificates/certificate.key",
    cert: "certificates/certificate.crt",
    verify_mode: 'none',
    no_tlsv1: true,
    no_tlsv1_1: true
  }

bind 'tcp://0.0.0.0:3500'

# Specifies the `pidfile` that Puma will use.
pidfile ENV.fetch("PIDFILE") { "server.pid" }

CLI : docker run --rm -p 3000:3000 -p 3500:3500 puma_tls_memory_leak

To Reproduce

I've created a repo and a dockerfile to be able to reproduce, using your hello world rack :
https://github.com/Melchizedech/rails_puma_tls_memory_leak

Running while :; do curl -k --location 'https://localhost:3000' -v ; done will increase the RAM used by the application from 25MB (RSS) to 60MB (RSS). Then every minute or so, another extra MB (RSS) will be consumed.

If we instead run while :; do curl --location 'http://localhost:3000/' -v ; done, the RAM used by the app increase from 25 to 32MB (RSS) and eventually, the app will start consuming extra MB of RAM.

I definitely think there is an issue on Puma with TLS.

Here is the graph of one application that we use, it is a small Rails app that receives close to no user traffic. The "real" traffic it receives comes from the liveness probe.

Cluster A :
App in a production cluster

Cluster B :
App in another production cluster

Here are some screenshots from the ram consumption of a brand new rails app only running the API mode, while we hit a non-existing route, behind a similar configuration.

Just after boot :
App after having booted in production mode

After 6 min of constant cURL
App after 6min of stress test on a non-existing route using HTTPS

After 28min of constant cURL
App after 28min of stress test on a non-existing route using HTTPS
While I was testing the brand new rails API to reproduce the issue (before swwitching for a Rack hello world app), I tried to use jemalloc, I've seen a strong improvement, though, it was still getting more RAM used (but a much, much, slower rate).

Expected behavior
I would expect the RAM consumption to not be increasing forever when I use TLS on Puma.

Desktop (please complete the following information):

  • OS: ArchLinux, (and also Debian via Dockerization)
  • Puma Version : 4.3.6 (but could also reproduce on 5.0.2)

Any help to debug/investigate that is welcomed.

@nateberkopec
Copy link
Member

Is there any way you could post a graph of what your cluster looks like without TLS enabled? Maybe put an SSL terminator in front of Puma, like nginx? Of course, we'll also be interested if this reproduces on 5.x.

Your production graphs are compelling. I'll take a look at the repro.

@Melchizedech
Copy link
Author

I can try to put that app without TLS in place on one of our staging environment and give you the result in several days.

Using such a solution won't work. (TL;DR: this is an app that requires encrypted traffic in our internal network. traffic between [outside, internal] <=> nginx will be encrypted but not nginx <=> puma). We will most likely use service mesh or a similar technology to remove the need for TLS within puma.

Here is three screenshots for puma 4.3.6 with HTTP
At boot time
puma_http_beg

After 7min
puma_http_7min

After 27min
puma_http_27min

here are 3 screenshots of puma 5.0.2 with HTTPS

At boot time
puma5_beg

After 6 minutes
puma5_6min

After 33 minutes
puma5_33min

These screenshots are using the same hello world rack app.
only difference was a change in the gemfile

diff --git a/Gemfile b/Gemfile
index 8418cc4..781766f 100644
--- a/Gemfile
+++ b/Gemfile
@@ -3,4 +3,4 @@ git_source(:github) { |repo| "https://github.com/#{repo}.git" }
 
 ruby '2.7.2'
 
-gem 'puma', '~> 4.1'
+gem 'puma', '~> 5.0'
diff --git a/Gemfile.lock b/Gemfile.lock
index 50eda8b..b55ce66 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -2,14 +2,14 @@ GEM
   remote: https://rubygems.org/
   specs:
     nio4r (2.5.4)
-    puma (4.3.6)
+    puma (5.0.2)
       nio4r (~> 2.0)
 
 PLATFORMS
   ruby
 
 DEPENDENCIES
-  puma (~> 4.1)
+  puma (~> 5.0)
 
 RUBY VERSION
    ruby 2.7.2p137

@nateberkopec
Copy link
Member

Using such a solution won't work.

No problem, it was a long shot. Just figured I'd ask.

I'm less interested in memory usage after 20 to 30 minutes than I am in rate of growth after 6 to 12 hours of uptime.

If you have the same rate of growth in memory T+0 to T+1h as you do T+5h to T+6h, you have a leak. If it slows down over time, you don't.

@Melchizedech
Copy link
Author

Okay, I'll deploy this hello world rack application with TLS, and I'll let it live for a day or two. I'll come back with the monitoring results.

@MSP-Greg
Copy link
Member

@Melchizedech

Thanks for all the info. Any info as to TLS protocols that are most likely being used?

In #1735, several people had issues with AWS not supporting TLSv1.3, which forced downgrade negotiation to TLSv1.2.
There wasn't any mention of memory leaks. Hence, I'm wondering if this could be TLSv1.3 related.

I'll try to send several thousand connections to a server locally and see if I can repo a rise in memory.

@Melchizedech
Copy link
Author

Hi @MSP-Greg, the TLS conf disables TLS 1.0 and TLS 1.1, leaving us with TLS 1.2 and TLS 1.3.

I don't know what is the protocol used by the liveness probe, but my guess would be 1.3.

The tests I performed were using TLS 1.3.
I have run another test, similar to the one previously, I've got the same increase using only TLS1.2
(while :; do curl --tls-max 1.2 -k --location 'https://localhost:3000' ; done) on a period over 30min.

@Melchizedech
Copy link
Author

FYI : using curl with TLS 1.2 instead of TLS 1.3 (so most likely having one downgrade phase).

at boot
puma5_tls12_beg

at 9 min
puma5_tls12_9min

at 29 min
puma5_tls12_29min


regarding the HTTP-only test, it is ongoing.

@Melchizedech
Copy link
Author

Here is the plot for the same app as described in "Cluster A", but using HTTP. We can see there is no change in ram consumption compared to TLS.
Screenshot_20201015_181228

@nateberkopec
Copy link
Member

@Melchizedech Was that app receiving any traffic? I assume your memory graphs from the TLS-on app was constantly receiving traffic.

@Melchizedech
Copy link
Author

Melchizedech commented Oct 15, 2020 via email

@MSP-Greg
Copy link
Member

@Melchizedech

I tested to see if there were any Ruby objects that GC was having issues with, and that doesn't seem to be an issue, in that running GC released all of them.

This weekend I'll run a few thousand TLS connections at a server and check the memory stats.

@dentarg
Copy link
Member

dentarg commented Feb 6, 2021

@Melchizedech do you want to try again with the newly released Puma 5.2 series?

@l33z3r
Copy link

l33z3r commented Oct 23, 2021

Hi guys, has anyone seen this issue resolved with the Puma 5.2 series?
I am having the same issue described here. thanks

@dentarg
Copy link
Member

dentarg commented Oct 31, 2021

@l33z3r maybe you can find out for us?

@l33z3r
Copy link

l33z3r commented Nov 1, 2021

Hi guys, just to chime in here. The issue I was having with leaking memory was not related to what is described in this issue.

@DaryMerckensGT
Copy link

Hi guys, just to chime in here. The issue I was having with leaking memory was not related to what is described in this issue.

what was it related to?

@l33z3r
Copy link

l33z3r commented Feb 16, 2022

Hi guys, just to chime in here. The issue I was having with leaking memory was not related to what is described in this issue.

what was it related to?

It was actually related to a memory leak in this gem: https://github.com/typhoeus/ethon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants