Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout during long file upload #2574

Closed
darkhelmet opened this issue Mar 12, 2021 · 11 comments · Fixed by #2606
Closed

Timeout during long file upload #2574

darkhelmet opened this issue Mar 12, 2021 · 11 comments · Fixed by #2606
Labels

Comments

@darkhelmet
Copy link
Contributor

Not really a bug, more clarification

We use a little file upload server that is a simple rack application and uses Puma as the server. I upgraded it from 4.3.3 to 5.2.2 and now we see timeouts after 30 seconds (first_data_timeout) when uploading files to it where the upload takes longer than 30 seconds. I can increase first_data_timeout and the timeout happens at the appropriate time.

An example using curl of an upload is pretty simple curl --data-binary @file.zip server:5000/upload?some=params

The application runs in our Kubernetes cluster behind Traefik which sits behind an AWS ALB/WAF. All that doesn't seem to matter though as using port-forward to talk directly to the pod gives the same result.

Looking in some code it looks like first_data_timeout is deprecated in v6? Or is that just the attr_writer for it?

Just wanting some guidance as to what I should be looking at to solve this problem. I could just revert back to 4.3.3 (which I'll probably do for now), but I'd like to keep upgrading and running the newer version for security, performance, etc., reasons.

I'm slightly confused as to why it's happening in the first place, since I don't know where the data is buffering for it to be waiting and causing puma to timeout. Or is the name first_data_timeout just not accurate in this case and it is getting data, but it's timing out for another reason?

Do I need some other puma configuration? Should I be running nginx in a sidebar to sit in front of puma?

Puma config

workers 1
threads 1, 16

preload_app!

rackup      DefaultRackup
port        5000
environment 'production'
@MSP-Greg
Copy link
Member

MSP-Greg commented Mar 12, 2021

Looking in some code it looks like first_data_timeout is deprecated in v6? Or is that just the attr_writer for it?

Just the attr_writer in Puma::Server. It will continue to be in Puma::DSL, which defines the methods used in a Puma config file.

You can add it to config via:

first_data_timeout <int seconds>

Or is the name first_data_timeout just not accurate in this case and it is getting data, but it's timing out for another reason?

Not sure. The DSL comment is "Define how long the tcp socket stays open, if no data has been received.", which certainly implies 'inactivity'. Could it have accidentally changed to what one might call first_request_timeout, not sure.

I can increase first_data_timeout and the timeout happens at the appropriate time.

If you do that in config, are you okay? A user may want to configure Puma so it sets a limit on a request body, whether that be by time or size. And, there maybe a need for an 'inactive' timeout also. There's several timeout settings for clients, I haven't reviewed for a while, as I've been looking at tests and response time...

Off-topic
Re the v6 deprecations, this is an example of what may have originally been a constant setting in Puma::Server, and was later added to Puma::DSL so it could be user configured. I really don't like an API that expose needless items, as it leads to the question of why it exists.

Also, there are a few classes where there's a pattern similar to:

obj = PumaSomething.new(opts)
# use a few attr_writers with opts
obj.attr1 = f(opts[:key1])
obj.attr2 = g(opts[:key2])
obj.meth3 obj3
obj.run(args)

We may change those patterns to simply PumaSomething.run(args, obj3, opts)

@dentarg
Copy link
Member

dentarg commented Mar 12, 2021

The application runs in our Kubernetes cluster behind Traefik which sits behind an AWS ALB/WAF. All that doesn't seem to matter though as using port-forward to talk directly to the pod gives the same result.

@darkhelmet does the problem also reproduce locally? I.e. running puma and curl on your laptop/work station

@darkhelmet
Copy link
Contributor Author

@MSP-Greg Yes, if I configure first_data_timeout in my puma.rb file, say first_data_timeout 60 it will timeout after 1 minute instead of 30 seconds. I could try cranking that number way up, but I haven't looked at what else that may affect.

@dentarg It was a long day and I tried to do that but of course it uploads the file so quickly that it works fine. So I did that thing where you look at yourself in the reflection in your monitor and are a little disappointed and moved on. I would have to figure out how to cause the upload to be slow locally I guess, which I haven't looked into how to do (since I'm trying to go directly to the application). I don't think the kubectl port-forward is buffering anything

The more I think about it the correct answer is I should use a little nginx sidecar container that does proxy_pass and can buffer the contents for Puma. When we first deployed this a long time ago, I thought of that, but it seemed to perform well without it. We put Cloudfront in front of it for serving files, so it's not like it's getting beat up by requests, and I was like "well, it deals with these uploads fine". Just some days it has to deal with big file uploads (1GB+) and they take a while. But still...if puma doesn't think there is data for it...where is that data buffering.

@cjlarose
Copy link
Member

The more I think about it the correct answer is I should use a little nginx sidecar container that does proxy_pass and can buffer the contents for Puma.

While this is possible and might help in your use case, what you're describing is something that puma should handle correctly, so it does seem like you're running into legitimate bug. Since you were on 4.x before and that working for you, I suspect the change that might be affecting you was our rewrite of the Reactor, which is responsible for client buffering in #2279

If you're able to test different versions of puma, it would help us if you figure out which version is the one that starts breaking for you. My guess is that it's version 5.0.3. But if you can identify the first failing version, that'll help us debug the issue. Alternatively, if you're able to put together a minimal application that reproduces the problem, that would work, too!

@cjlarose
Copy link
Member

Also I figure I should probably loop in @wjordan. Is this something you've seen before?

@wjordan
Copy link
Contributor

wjordan commented Mar 13, 2021

I haven't seen this before, but it does look possible that #2279 introduced a regression where first_data_timeout now unintentionally works as a 'total request timeout' instead of an first/in-between-bytes timeout which is how it behaved previously (the 'first data' name and comment description had already become obsolete at some point). In most cases the default timeout is large enough to not introduce an issue and we probably don't have tests distinguishing these two cases which is perhaps why it slipped by until now.

I think this could be fixed by adding an else clause to the conditional in reactor_wakeup, so the timeout is reset whenever any bytes are received on the connection:

puma/lib/puma/server.rb

Lines 294 to 298 in c68e20d

if client.try_to_finish || (shutdown && !client.can_close?)
@thread_pool << client
elsif shutdown || client.timeout == 0
client.timeout!
end

Something like:

else
  client.set_timeout(first_data_timeout)
  false
end

That said, perhaps a fix for this would be a good time to introduce a new between_bytes_timeout config (that defaults to the value of first_data_timeout) for the timeout in between bytes, so that the name/description for first_data_timeout is once again accurate?

For reference, Varnish has a separate first_byte_timeout and between_bytes_timeout config like this, while nginx has a separate timeout for the header (client_header_timeout) and body (client_body_timeout).

@darkhelmet
Copy link
Contributor Author

Okay, local testing was a success!

(Replace cr12.zip with any "large" file that would take more than 30 seconds to upload at 1MB/s)

Using puma 5.0.2 and curl -v -X PUT --data-binary @cr12.zip http://localhost:5000/upload?file=cr12.zip --limit-rate 1m works
Using puma 5.0.3 and curl -v -X PUT --data-binary @cr12.zip http://localhost:5000/upload?file=cr12.zip --limit-rate 1m fails

With latest puma 5.2.2 and config.ru of run lambda { |env| [200, {"Content-Type" => "text/plain"}, ["Hello World"]] } using curl -v -X PUT --data-binary @cr12.zip http://localhost:9292 --limit-rate 1m fails at the 30 second mark.

So it appears 5.0.3 is the breaking version.

@wjordan I also pulled the master branch, made that suggested change, built/installed the gem, ran the simple repro case, and it works!

@darkhelmet
Copy link
Contributor Author

darkhelmet commented Mar 13, 2021

I started a PR, not sure on the tests I wrote though: #2575

(Unrelated, it's showing me as a contributor, I was like wtf, and found my random PR from 2013 lol #305)

@nateberkopec
Copy link
Member

Welcome back, then @darkhelmet! 😆

@cjlarose cjlarose added the bug label Mar 18, 2021
@bmclean
Copy link

bmclean commented Mar 29, 2021

I like the between_bytes_timeout concept @darkhelmet. Since 5.0.3 we have been running with a large value set for first_data_timeout to handle slow clients (see #2512).

@darkhelmet
Copy link
Contributor Author

I'll need some help on my PR, I think I'm close, but I'm missing some stuff.

wjordan added a commit to wjordan/puma that referenced this issue Apr 20, 2021
Fixes puma#2574 (regression in v5.0.3).

Co-authored-by: Daniel Huckstep <daniel@huckstep.ca>
Co-authored-by: Will Jordan <will@code.org>
MSP-Greg pushed a commit that referenced this issue Apr 20, 2021
* Fix timing out when we get a slow request
Fixes #2574 (regression in v5.0.3).

Co-authored-by: Daniel Huckstep <daniel@huckstep.ca>
Co-authored-by: Will Jordan <will@code.org>

* Update test_timeout_in_data_phase
Complete request to ensure short timeout is used properly.

Co-authored-by: Daniel Huckstep <daniel@huckstep.ca>
JuanitoFatas pushed a commit to JuanitoFatas/puma that referenced this issue Sep 9, 2022
* Fix timing out when we get a slow request
Fixes puma#2574 (regression in v5.0.3).

Co-authored-by: Daniel Huckstep <daniel@huckstep.ca>
Co-authored-by: Will Jordan <will@code.org>

* Update test_timeout_in_data_phase
Complete request to ensure short timeout is used properly.

Co-authored-by: Daniel Huckstep <daniel@huckstep.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants