Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB #1735

Closed
amichal opened this issue Mar 6, 2019 · 17 comments · Fixed by #2675
Closed

Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB #1735

amichal opened this issue Mar 6, 2019 · 17 comments · Fixed by #2675

Comments

@amichal
Copy link

amichal commented Mar 6, 2019

Steps to reproduce

  1. rails app running in puma (Version 3.12.0 (ruby 2.5.3-p105 serving a simple text file out of /public

  2. puma running as a systemd service

[Service]
Type=forking
WorkingDirectory=/var/www/test-app/current
EnvironmentFile=/var/www/test-app/shared/.env.production
ExecStart=/bin/bash -lc 'cd /var/www/test-app/current && /usr/share/rvm/bin/rvm default do bundle exec puma --daemon --pidfile /var/www/test-app/shared/tmp/pids/puma.pid --bind "ssl://0.0.0.0:3000?key=/var/www/test-app/shared/ssl-cert-snakeoil.key&cert=/var/www/test-app/shared/ssl-cert-snakeoil.pem" --config /var/www/test-app/current/config/puma.rb --environment production --redirect-stdout /var/www/test-app/shared/log/puma.log --redirect-stderr /var/www/test-app/shared/log/puma.error.log'
PIDFile=/var/www/test-app/shared/tmp/pids/puma.pid
  1. Configure a AWS Application load balancer to accept https traffic on (443) and forward to the puma instance above via https (port 3000)

  2. Make lots of requests via a browser. Also can be automated. e.g.:
    siege -b --no-parser -t1m -c30 https://lblb......eu-central-1.elb.amazonaws.com/robots.txt | grep 502

Expected behavior

100% 200 OK

Actual behavior

A small percentage ( 5-10 requests from 4000 completed in 1 minute) for the above test fails with a 502 Bad Gateway.

System configuration

ruby 2.5.3-p105
Rails 5.2.2
puma version 3.12.0

AWS ELB logs for failed requests show the request arriving -- taking <1ms to be read by AWS and <0.5s to be processed in total. They show puma connection closed without a reason and consistently processing 216 bytes from puma. Puma stdout shows nothing, neither does stderr. I've added x-aws-trace-id headers to my rails request logs to confirm that the rails app never gets a chance to log the failing requests.

ENV

RAILS_ENV = producton
RAILS_MAX_THREADS = 8
WEB_CONCURRENCY = 6
on a t2.xlarge with nothing else running on it

puma.rb

# Puma can serve each request in a thread from an internal thread pool.
# The `threads` method setting takes two numbers: a minimum and maximum.
# Any libraries that use thread pools should be configured to match
# the maximum value specified for Puma. Default is set to 5 threads for minimum
# and maximum; this matches the default thread size of Active Record.
#
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count


persistent_timeout = 75 # ALB  keeps connections alive for 60s so we need to be longer then that

# Specifies the `port` that Puma will listen on to receive requests; default is 3000.
#
port        ENV.fetch("PORT") { 3000 }

# Specifies the `environment` that Puma will run in.
#
environment ENV.fetch("RAILS_ENV") { "development" }

# Specifies the number of `workers` to boot in clustered mode.
# Workers are forked webserver processes. If using threads and workers together
# the concurrency of the application would be max `threads` * `workers`.
# Workers do not work on JRuby or Windows (both of which do not support
# processes).
#
workers ENV.fetch("WEB_CONCURRENCY") { 2 }

# Use the `preload_app!` method when specifying a `workers` number.
# This directive tells Puma to first boot the application and load code
# before forking the application. This takes advantage of Copy On Write
# process behavior so workers use less memory. If you use this option
# you need to make sure to reconnect any threads in the `on_worker_boot`
# block.
#
# preload_app!

# If you are preloading your application and using Active Record, it's
# recommended that you close any connections to the database before workers
# are forked to prevent connection leakage.
#
# before_fork do
#   ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
# end

# The code in the `on_worker_boot` will be called if you are using
# clustered mode by specifying a number of `workers`. After each worker
# process is booted, this block will be run. If you are using the `preload_app!`
# option, you will want to use this block to reconnect to any threads
# or connections that may have been created at application boot, as Ruby
# cannot share connections between processes.
#
# on_worker_boot do
#   ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
# end
#

# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart
@amichal
Copy link
Author

amichal commented Mar 6, 2019

If i change only the --bind "ssl://0.0.0.0:3000?key=/var/www/test-app/shared/ssl-cert-snakeoil.key&cert=/var/www/test-app/shared/ssl-cert-snakeoil.pem" to --bind "tcp://0.0.0.0:3000" AWS never reports 502 errors (where never means not with 10k sustained requests with the above siege line)

@amichal
Copy link
Author

amichal commented Mar 7, 2019

Similarly if it use siege to hit puma on port 3000 directly i get no 502s. The problem is the interaction between AWS ALB and puma with SSL

@wkhatch
Copy link

wkhatch commented Mar 11, 2019

Is there a reason you're using ssl in the container for puma, as opposed to forwarding from https to http on the ALB? How are your listeners configured for the ALB? I came across this while trying to determine what the best strategy is for configuring puma for a rails app on ECS, so I'm curious as to what you're doing. I'm planning on terminating ssl at the alb, and relying on the x-forwarded-proto header to determine if the request was over https.

@mt-kelvintaywl
Copy link

@amichal
I wonder if this is too late to chime in but i noticed perhaps the setting for persistent_timeout should be:

- persistent_timeout = 75
+ persistent_timeout 75

#1017

I was facing a similar issue and was thinking setting the persistent_timeout correctly should narrow or solve the problem, based on the docs from AWS:

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html#http-502-issues

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout

@mt-kelvintaywl
Copy link

mt-kelvintaywl commented Mar 19, 2020

Screen Shot 2020-03-18 at 21 37 10

this is still a relevant issue for us after setting the right `persistent_timeout` value (> ALB idle timeout).

As you can see above, the patterns of the HTTP 502 errors and the target TLS negotiation error counts are similar and likely correlated.

[updated]
Please see #1735 (comment) below

@MSP-Greg
Copy link
Member

What version of OpenSSL is being used?

@mt-kelvintaywl
Copy link

Hi @MSP-Greg , thanks for responding :)

for context, in our dockerized containers:

$ bundle exec rails runner "require 'puma/minissl'; require 'puma/puma_http11'; Puma::Server.class; Puma::MiniSSL.check; puts Puma::MiniSSL::OPENSSL_LIBRARY_VERSION"

OpenSSL 1.1.1c  28 May 2019

$ bundle info puma

  * puma (4.3.3)

@MSP-Greg
Copy link
Member

MSP-Greg commented Mar 19, 2020

I don't know...

I tried to find info about the AWS load balancers, and the docs I found seemed to indicate that they (at present) only support TLSv1.2?

Apps vary as to a what level they participate in OpenSSL's negotiation. Puma (and many other apps) leave it to OpenSSL. But, with any version of 1.1.1, the negotiation starts are TLSv1.3 and is negotiated down. There is a way to force OpenSSL to start at TLSv1.2, but it isn't available in Puma, somewhat based on 'why would you want to do that?', but now there might be a reason for doing so...

Anyway, I'm wondering if somehow OpenSSL is choking on all the (downward) negotiation, but I've never heard of that kind of issue. If so, one option would be building with OpenSSL 1.0.2?

I basing this on:

  1. People don't seem to have issues when using Puma with NGINX, and it supports TLSv1.3.

  2. People don't have issues when testing against Puma directly?

  3. Assuming I'm correct that AWS load balancers don't support TLSv1.3.

@mt-kelvintaywl
Copy link

mt-kelvintaywl commented Mar 24, 2020

@MSP-Greg
thanks so much for the correspondence.

I tried to find info about the AWS load balancers, and the docs I found seemed to indicate that they (at present) only support TLSv1.2?

Indeed, it is only up to TLSv1.2 for now, based on the documentations and their underlying library's roadmap

I am totally not a TLS/SSL expert but to my understanding, it would indeed be a negotiation down (of versions).
In this case, the AWS ELB is the client and requests for TLS v1.2 which Puma (at v4.3.3 with OpenSSL 1.1.1) can negotiate v1.2 with.

# I have a local puma server with SSL bindings at localhost:3001
$ sslscan --show-ciphers https://localhost:3001

...
  TLS Fallback SCSV:
Server supports TLS Fallback SCSV

  TLS renegotiation:
Session renegotiation not supported

  TLS Compression:
Compression disabled

  Heartbleed:
TLS 1.2 not vulnerable to heartbleed
TLS 1.1 not vulnerable to heartbleed
TLS 1.0 not vulnerable to heartbleed

  Supported Server Cipher(s):
Preferred TLSv1.2  256 bits  ECDHE-RSA-AES256-GCM-SHA384   Curve P-256 DHE 256
Accepted  TLSv1.2  256 bits  ECDHE-RSA-AES256-SHA384       Curve P-256 DHE 256
Accepted  TLSv1.2  256 bits  ECDHE-RSA-AES256-SHA          Curve P-256 DHE 256
Accepted  TLSv1.2  256 bits  AES256-GCM-SHA384            
Accepted  TLSv1.2  256 bits  AES256-SHA256                
Accepted  TLSv1.2  256 bits  AES256-SHA                   
Accepted  TLSv1.2  256 bits  CAMELLIA256-SHA              
Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-GCM-SHA256   Curve P-256 DHE 256
Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-SHA256       Curve P-256 DHE 256
Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-SHA          Curve P-256 DHE 256
Accepted  TLSv1.2  128 bits  AES128-GCM-SHA256            
Accepted  TLSv1.2  128 bits  AES128-SHA256                
Accepted  TLSv1.2  128 bits  AES128-SHA                   
Accepted  TLSv1.2  128 bits  CAMELLIA128-SHA
...          

so it'd seem the negotiation should be fine (at least for most of the time).

But yes, the negotiation downwards may take time which AWS ELBs can be critical with.

The fact that the HTTP 502 is intermittent makes it hard to debug / reproduce.

I could verify with our cloudMetrics on the AWS Load Balancer that prior to HTTPS setup, no HTTP 502 occur (consistent traffic before & after).

I would try with forcing TLS v1.2 with Puma to see if it solves the issue 🙇

@mt-kelvintaywl
Copy link

mt-kelvintaywl commented Mar 26, 2020

updates:

We took the possible issue up with AWS tech support to see if they can shed some light, and it does hint at extending the idle_timeout to be > AWS ELB's idle connection timeout setting.

Particularly:

The load balancer encountered an SSL handshake error or SSL handshake timeout (10 seconds) when connecting to a target.

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html#http-502-issues

Previously, we set the Puma idle_timeout config to be exactly +10 secs more than that of the AWS ALB and still encountered the intermittent 502s. As mentioned in the thread above, the TLS negotiation down (from TLS v1.3 to TLS v1.2 likely took some time and can cause TLS handshake errors based on thresholds).

I've modified this to be a larger buffer beyond 10 secs and setup alerts on our CloudMetrics for any presence of HTTP 502. So far so good for a day of traffic (of course, an absence of HTTP 502 doesn't necessarily meant this fixed it but I think we can consider it to be so) :)

Thank you @MSP-Greg and @nateberkopec for the look at this 🙇

@MSP-Greg
Copy link
Member

@mt-kelvintaywl

Thanks for the update. As mentioned, I previously thought the need for having the equivalent of a max_version= protocol method seemed unlikely. But, you've shown a good reason for one.

If one has an app with fast response times, the time for downward negotiation is a more significant part of the total response time.

Adding the method is relatively easy assuming OpenSSL 1.1.x, but gets a bit more involved when compatibility with 1.0.2 is desired...

@nateberkopec nateberkopec changed the title Intermittent 502 Bad Gateway from AWS Application Load Balancer to puma via SSL Lack of SSL max_version can cause excessive slow negotation, 502 on AWS w/ELB Mar 30, 2020
@nateberkopec nateberkopec added feature and removed bug labels Mar 30, 2020
@amichal
Copy link
Author

amichal commented Oct 6, 2020

I've ignored this issue (it's very intermittent for us) for a long time but i just wanted drop back in and say it still happens in 4.3.5. I'm trying the persistent_timeout 74 and have tried unsuccessfully playing around with workers/threads to see if its statistically related to the connection pool. We have recently learned that it seems to happen more after the site has been idle for 30s or more. Working on some more concrete tests but if anyone has a deeper understanding of what's going on and wants some logs to track this down let me know

@amichal
Copy link
Author

amichal commented Oct 6, 2020

correctly setting persistent_timeout 75 appears to have resolved this for me

@amichal
Copy link
Author

amichal commented Oct 13, 2020

And now I think I understand the mechanism. The default PERSISTENT_TIMEOUT = 20 seconds in puma is less than the default idle_timeout.timeout_seconds = 60 seconds in ALBs. When switching to use TLS from the ALB to puma we effectively caused ALB/puma to start using Keep-Alives. puma closes the connection first and ALB understands the backend connection to have died and serves a 502 for one more request. My attempt to fix this ages ago had a typo but was the right solution for me (and it turns out others on my team in other contexts). It was hard for us to find because the connection needed to idle for 20-60s. Any shorter and puma would still have a connection, any longer and ALB would have dropped it.

@MSP-Greg
Copy link
Member

I've created PR #2426 which allows a no_tlsv1_3 hash key in ssl_bind, Setting to true will disable TLSv1.3.

If anyone can try it and see if it helps with the AWS issue, it would be appreciated.

@siddhant-mohan
Copy link

I am getting a lot of 502, and after 502 somehow, container gets restarted. I am using ecs with AWS elb and ACM for ssl certification. I am also running rails server directly on 8080 instead of running it on 3000 and then using nginx to bind 80 port to 3000, will that effect the performance in terms of 502?

@misdoro
Copy link
Contributor

misdoro commented Aug 30, 2021

The 502 on AWS ELB seem to be caused by the ssl connections being closed by puma without sending the shutdown alert, as desribed in #2675. The shutdown alerts were accidentally removed by the changes in #1334 a while ago.

Increasing the persistent_timeout above 60 seconds half-solves the problem by ensuring puma does not close connections before ELB does. However if for some reason a worker is stopped, if for example the puma-worker-killer is used, it still causes some 502 for the clients.

Our production case:
Number of 502 responses per 3-hour intervals.
kibana_http_502

  • Before 2021-08-25: Standard puma configuration, about 100 HTTP 502 per hour
  • Between 2021-08-25 and 2021-08-27: persistent_timeout 75, about 10 HTTP 502 per hour due to worker restarts by puma worker killer
  • Since 2021-08-27: default persistent timeout + code from Fix performance of server-side SSL connection close. #2675, less than 10 HTTP 502 per day, still correlated to worker restarts, probably when the ALB receives the SSL shutdown alert right after or at the same time as it sends the request to puma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants