Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB #1735

amichal · 2019-03-06T22:24:15Z

Steps to reproduce

rails app running in puma (Version 3.12.0 (ruby 2.5.3-p105 serving a simple text file out of /public
puma running as a systemd service

[Service]
Type=forking
WorkingDirectory=/var/www/test-app/current
EnvironmentFile=/var/www/test-app/shared/.env.production
ExecStart=/bin/bash -lc 'cd /var/www/test-app/current && /usr/share/rvm/bin/rvm default do bundle exec puma --daemon --pidfile /var/www/test-app/shared/tmp/pids/puma.pid --bind "ssl://0.0.0.0:3000?key=/var/www/test-app/shared/ssl-cert-snakeoil.key&cert=/var/www/test-app/shared/ssl-cert-snakeoil.pem" --config /var/www/test-app/current/config/puma.rb --environment production --redirect-stdout /var/www/test-app/shared/log/puma.log --redirect-stderr /var/www/test-app/shared/log/puma.error.log'
PIDFile=/var/www/test-app/shared/tmp/pids/puma.pid

Configure a AWS Application load balancer to accept https traffic on (443) and forward to the puma instance above via https (port 3000)
Make lots of requests via a browser. Also can be automated. e.g.:
siege -b --no-parser -t1m -c30 https://lblb......eu-central-1.elb.amazonaws.com/robots.txt | grep 502

Expected behavior

100% 200 OK

Actual behavior

A small percentage ( 5-10 requests from 4000 completed in 1 minute) for the above test fails with a 502 Bad Gateway.

System configuration

ruby 2.5.3-p105
Rails 5.2.2
puma version 3.12.0

AWS ELB logs for failed requests show the request arriving -- taking <1ms to be read by AWS and <0.5s to be processed in total. They show puma connection closed without a reason and consistently processing 216 bytes from puma. Puma stdout shows nothing, neither does stderr. I've added x-aws-trace-id headers to my rails request logs to confirm that the rails app never gets a chance to log the failing requests.

ENV

RAILS_ENV = producton
RAILS_MAX_THREADS = 8
WEB_CONCURRENCY = 6
on a t2.xlarge with nothing else running on it

puma.rb

# Puma can serve each request in a thread from an internal thread pool.
# The `threads` method setting takes two numbers: a minimum and maximum.
# Any libraries that use thread pools should be configured to match
# the maximum value specified for Puma. Default is set to 5 threads for minimum
# and maximum; this matches the default thread size of Active Record.
#
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count


persistent_timeout = 75 # ALB  keeps connections alive for 60s so we need to be longer then that

# Specifies the `port` that Puma will listen on to receive requests; default is 3000.
#
port        ENV.fetch("PORT") { 3000 }

# Specifies the `environment` that Puma will run in.
#
environment ENV.fetch("RAILS_ENV") { "development" }

# Specifies the number of `workers` to boot in clustered mode.
# Workers are forked webserver processes. If using threads and workers together
# the concurrency of the application would be max `threads` * `workers`.
# Workers do not work on JRuby or Windows (both of which do not support
# processes).
#
workers ENV.fetch("WEB_CONCURRENCY") { 2 }

# Use the `preload_app!` method when specifying a `workers` number.
# This directive tells Puma to first boot the application and load code
# before forking the application. This takes advantage of Copy On Write
# process behavior so workers use less memory. If you use this option
# you need to make sure to reconnect any threads in the `on_worker_boot`
# block.
#
# preload_app!

# If you are preloading your application and using Active Record, it's
# recommended that you close any connections to the database before workers
# are forked to prevent connection leakage.
#
# before_fork do
#   ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
# end

# The code in the `on_worker_boot` will be called if you are using
# clustered mode by specifying a number of `workers`. After each worker
# process is booted, this block will be run. If you are using the `preload_app!`
# option, you will want to use this block to reconnect to any threads
# or connections that may have been created at application boot, as Ruby
# cannot share connections between processes.
#
# on_worker_boot do
#   ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
# end
#

# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart

The text was updated successfully, but these errors were encountered:

amichal · 2019-03-06T22:27:55Z

If i change only the --bind "ssl://0.0.0.0:3000?key=/var/www/test-app/shared/ssl-cert-snakeoil.key&cert=/var/www/test-app/shared/ssl-cert-snakeoil.pem" to --bind "tcp://0.0.0.0:3000" AWS never reports 502 errors (where never means not with 10k sustained requests with the above siege line)

amichal · 2019-03-07T12:50:54Z

Similarly if it use siege to hit puma on port 3000 directly i get no 502s. The problem is the interaction between AWS ALB and puma with SSL

wkhatch · 2019-03-11T18:37:16Z

Is there a reason you're using ssl in the container for puma, as opposed to forwarding from https to http on the ALB? How are your listeners configured for the ALB? I came across this while trying to determine what the best strategy is for configuring puma for a rails app on ECS, so I'm curious as to what you're doing. I'm planning on terminating ssl at the alb, and relying on the x-forwarded-proto header to determine if the request was over https.

mt-kelvintaywl · 2020-03-18T06:21:19Z

@amichal
I wonder if this is too late to chime in but i noticed perhaps the setting for persistent_timeout should be:

- persistent_timeout = 75
+ persistent_timeout 75

#1017

I was facing a similar issue and was thinking setting the persistent_timeout correctly should narrow or solve the problem, based on the docs from AWS:

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html#http-502-issues

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout

mt-kelvintaywl · 2020-03-19T00:08:39Z

this is still a relevant issue for us after setting the right `persistent_timeout` value (> ALB idle timeout).

As you can see above, the patterns of the HTTP 502 errors and the target TLS negotiation error counts are similar and likely correlated.

[updated]
Please see #1735 (comment) below

MSP-Greg · 2020-03-19T00:13:32Z

What version of OpenSSL is being used?

mt-kelvintaywl · 2020-03-19T00:33:02Z

Hi @MSP-Greg , thanks for responding :)

for context, in our dockerized containers:

$ bundle exec rails runner "require 'puma/minissl'; require 'puma/puma_http11'; Puma::Server.class; Puma::MiniSSL.check; puts Puma::MiniSSL::OPENSSL_LIBRARY_VERSION"

OpenSSL 1.1.1c  28 May 2019

$ bundle info puma

  * puma (4.3.3)

MSP-Greg · 2020-03-19T17:07:39Z

I don't know...

I tried to find info about the AWS load balancers, and the docs I found seemed to indicate that they (at present) only support TLSv1.2?

Apps vary as to a what level they participate in OpenSSL's negotiation. Puma (and many other apps) leave it to OpenSSL. But, with any version of 1.1.1, the negotiation starts are TLSv1.3 and is negotiated down. There is a way to force OpenSSL to start at TLSv1.2, but it isn't available in Puma, somewhat based on 'why would you want to do that?', but now there might be a reason for doing so...

Anyway, I'm wondering if somehow OpenSSL is choking on all the (downward) negotiation, but I've never heard of that kind of issue. If so, one option would be building with OpenSSL 1.0.2?

I basing this on:

People don't seem to have issues when using Puma with NGINX, and it supports TLSv1.3.
People don't have issues when testing against Puma directly?
Assuming I'm correct that AWS load balancers don't support TLSv1.3.

mt-kelvintaywl · 2020-03-24T00:40:44Z

@MSP-Greg
thanks so much for the correspondence.

I tried to find info about the AWS load balancers, and the docs I found seemed to indicate that they (at present) only support TLSv1.2?

Indeed, it is only up to TLSv1.2 for now, based on the documentations and their underlying library's roadmap

I am totally not a TLS/SSL expert but to my understanding, it would indeed be a negotiation down (of versions).
In this case, the AWS ELB is the client and requests for TLS v1.2 which Puma (at v4.3.3 with OpenSSL 1.1.1) can negotiate v1.2 with.

# I have a local puma server with SSL bindings at localhost:3001
$ sslscan --show-ciphers https://localhost:3001

...
  TLS Fallback SCSV:
Server supports TLS Fallback SCSV

  TLS renegotiation:
Session renegotiation not supported

  TLS Compression:
Compression disabled

  Heartbleed:
TLS 1.2 not vulnerable to heartbleed
TLS 1.1 not vulnerable to heartbleed
TLS 1.0 not vulnerable to heartbleed

  Supported Server Cipher(s):
Preferred TLSv1.2  256 bits  ECDHE-RSA-AES256-GCM-SHA384   Curve P-256 DHE 256
Accepted  TLSv1.2  256 bits  ECDHE-RSA-AES256-SHA384       Curve P-256 DHE 256
Accepted  TLSv1.2  256 bits  ECDHE-RSA-AES256-SHA          Curve P-256 DHE 256
Accepted  TLSv1.2  256 bits  AES256-GCM-SHA384            
Accepted  TLSv1.2  256 bits  AES256-SHA256                
Accepted  TLSv1.2  256 bits  AES256-SHA                   
Accepted  TLSv1.2  256 bits  CAMELLIA256-SHA              
Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-GCM-SHA256   Curve P-256 DHE 256
Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-SHA256       Curve P-256 DHE 256
Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-SHA          Curve P-256 DHE 256
Accepted  TLSv1.2  128 bits  AES128-GCM-SHA256            
Accepted  TLSv1.2  128 bits  AES128-SHA256                
Accepted  TLSv1.2  128 bits  AES128-SHA                   
Accepted  TLSv1.2  128 bits  CAMELLIA128-SHA
...

so it'd seem the negotiation should be fine (at least for most of the time).

But yes, the negotiation downwards may take time which AWS ELBs can be critical with.

The fact that the HTTP 502 is intermittent makes it hard to debug / reproduce.

I could verify with our cloudMetrics on the AWS Load Balancer that prior to HTTPS setup, no HTTP 502 occur (consistent traffic before & after).

I would try with forcing TLS v1.2 with Puma to see if it solves the issue 🙇

mt-kelvintaywl · 2020-03-26T01:56:03Z

updates:

We took the possible issue up with AWS tech support to see if they can shed some light, and it does hint at extending the idle_timeout to be > AWS ELB's idle connection timeout setting.

Particularly:

The load balancer encountered an SSL handshake error or SSL handshake timeout (10 seconds) when connecting to a target.

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html#http-502-issues

Previously, we set the Puma idle_timeout config to be exactly +10 secs more than that of the AWS ALB and still encountered the intermittent 502s. As mentioned in the thread above, the TLS negotiation down (from TLS v1.3 to TLS v1.2 likely took some time and can cause TLS handshake errors based on thresholds).

I've modified this to be a larger buffer beyond 10 secs and setup alerts on our CloudMetrics for any presence of HTTP 502. So far so good for a day of traffic (of course, an absence of HTTP 502 doesn't necessarily meant this fixed it but I think we can consider it to be so) :)

Thank you @MSP-Greg and @nateberkopec for the look at this 🙇

MSP-Greg · 2020-03-26T02:34:13Z

@mt-kelvintaywl

Thanks for the update. As mentioned, I previously thought the need for having the equivalent of a max_version= protocol method seemed unlikely. But, you've shown a good reason for one.

If one has an app with fast response times, the time for downward negotiation is a more significant part of the total response time.

Adding the method is relatively easy assuming OpenSSL 1.1.x, but gets a bit more involved when compatibility with 1.0.2 is desired...

amichal · 2020-10-06T18:22:49Z

I've ignored this issue (it's very intermittent for us) for a long time but i just wanted drop back in and say it still happens in 4.3.5. I'm trying the persistent_timeout 74 and have tried unsuccessfully playing around with workers/threads to see if its statistically related to the connection pool. We have recently learned that it seems to happen more after the site has been idle for 30s or more. Working on some more concrete tests but if anyone has a deeper understanding of what's going on and wants some logs to track this down let me know

amichal · 2020-10-06T18:30:58Z

correctly setting persistent_timeout 75 appears to have resolved this for me

amichal · 2020-10-13T16:58:53Z

And now I think I understand the mechanism. The default PERSISTENT_TIMEOUT = 20 seconds in puma is less than the default idle_timeout.timeout_seconds = 60 seconds in ALBs. When switching to use TLS from the ALB to puma we effectively caused ALB/puma to start using Keep-Alives. puma closes the connection first and ALB understands the backend connection to have died and serves a 502 for one more request. My attempt to fix this ages ago had a typo but was the right solution for me (and it turns out others on my team in other contexts). It was hard for us to find because the connection needed to idle for 20-60s. Any shorter and puma would still have a connection, any longer and ALB would have dropped it.

MSP-Greg · 2020-10-13T22:33:48Z

I've created PR #2426 which allows a no_tlsv1_3 hash key in ssl_bind, Setting to true will disable TLSv1.3.

If anyone can try it and see if it helps with the AWS issue, it would be appreciated.

siddhant-mohan · 2021-04-22T10:37:46Z

I am getting a lot of 502, and after 502 somehow, container gets restarted. I am using ecs with AWS elb and ACM for ssl certification. I am also running rails server directly on 8080 instead of running it on 3000 and then using nginx to bind 80 port to 3000, will that effect the performance in terms of 502?

misdoro · 2021-08-30T11:35:10Z

The 502 on AWS ELB seem to be caused by the ssl connections being closed by puma without sending the shutdown alert, as desribed in #2675. The shutdown alerts were accidentally removed by the changes in #1334 a while ago.

Increasing the persistent_timeout above 60 seconds half-solves the problem by ensuring puma does not close connections before ELB does. However if for some reason a worker is stopped, if for example the puma-worker-killer is used, it still causes some 502 for the clients.

Our production case:
Number of 502 responses per 3-hour intervals.

Before 2021-08-25: Standard puma configuration, about 100 HTTP 502 per hour
Between 2021-08-25 and 2021-08-27: persistent_timeout 75, about 10 HTTP 502 per hour due to worker restarts by puma worker killer
Since 2021-08-27: default persistent timeout + code from Fix performance of server-side SSL connection close. #2675, less than 10 HTTP 502 per day, still correlated to worker restarts, probably when the ALB receives the SSL shutdown alert right after or at the same time as it sends the request to puma.

nateberkopec added the bug label Apr 20, 2019

nateberkopec added the ssl label May 4, 2019

nateberkopec added needs-repro bug and removed bug needs-repro labels Sep 19, 2019

nateberkopec changed the title ~~Intermittent 502 Bad Gateway from AWS Application Load Balancer to puma via SSL~~ Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB Mar 30, 2020

nateberkopec added feature and removed bug labels Mar 30, 2020

MSP-Greg mentioned this issue Oct 13, 2020

Add no_tlsv1_3 to Puma::DSL#ssl_bindings and Puma::MiniSSL::Context #2426

Closed

8 tasks

MSP-Greg mentioned this issue Oct 14, 2020

Puma 4.3.6 - Rails 6.0.3.4 - TLS brings in memory leak #2425

Closed

misdoro mentioned this issue Aug 26, 2021

Fix performance of server-side SSL connection close. #2675

Merged

7 tasks

nateberkopec linked a pull request Aug 27, 2021 that will close this issue

Fix performance of server-side SSL connection close. #2675

Merged

7 tasks

MSP-Greg closed this as completed in #2675 Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB #1735

Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB #1735

amichal commented Mar 6, 2019 •

edited

amichal commented Mar 6, 2019 •

edited

amichal commented Mar 7, 2019 •

edited

wkhatch commented Mar 11, 2019

mt-kelvintaywl commented Mar 18, 2020

mt-kelvintaywl commented Mar 19, 2020 •

edited

MSP-Greg commented Mar 19, 2020

mt-kelvintaywl commented Mar 19, 2020

MSP-Greg commented Mar 19, 2020 •

edited

mt-kelvintaywl commented Mar 24, 2020 •

edited

mt-kelvintaywl commented Mar 26, 2020 •

edited

MSP-Greg commented Mar 26, 2020

amichal commented Oct 6, 2020

amichal commented Oct 6, 2020

amichal commented Oct 13, 2020

MSP-Greg commented Oct 13, 2020

siddhant-mohan commented Apr 22, 2021

misdoro commented Aug 30, 2021 •

edited

Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB #1735

Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB #1735

Comments

amichal commented Mar 6, 2019 • edited

Steps to reproduce

Expected behavior

Actual behavior

System configuration

ENV

amichal commented Mar 6, 2019 • edited

amichal commented Mar 7, 2019 • edited

wkhatch commented Mar 11, 2019

mt-kelvintaywl commented Mar 18, 2020

mt-kelvintaywl commented Mar 19, 2020 • edited

MSP-Greg commented Mar 19, 2020

mt-kelvintaywl commented Mar 19, 2020

MSP-Greg commented Mar 19, 2020 • edited

mt-kelvintaywl commented Mar 24, 2020 • edited

mt-kelvintaywl commented Mar 26, 2020 • edited

MSP-Greg commented Mar 26, 2020

amichal commented Oct 6, 2020

amichal commented Oct 6, 2020

amichal commented Oct 13, 2020

MSP-Greg commented Oct 13, 2020

siddhant-mohan commented Apr 22, 2021

misdoro commented Aug 30, 2021 • edited

amichal commented Mar 6, 2019 •

edited

amichal commented Mar 6, 2019 •

edited

amichal commented Mar 7, 2019 •

edited

mt-kelvintaywl commented Mar 19, 2020 •

edited

MSP-Greg commented Mar 19, 2020 •

edited

mt-kelvintaywl commented Mar 24, 2020 •

edited

mt-kelvintaywl commented Mar 26, 2020 •

edited

misdoro commented Aug 30, 2021 •

edited