Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second Request Very Slow in Clustered Mode on 5.0.3, 5.0.4 and master. #2484

Open
joshuapinter opened this issue Nov 7, 2020 · 46 comments
Open

Comments

@joshuapinter
Copy link

joshuapinter commented Nov 7, 2020

Describe the bug
I don't know all the details yet so I'll try and explain it the best I can. I noticed that the second (and subsequent) requests after launching the puma server was very slow compared to the original. e.g. 30 seconds instead of 0.3 seconds. I also notice that it takes much longer to quit / exit the server. Like 15 seconds instead of 0.5.

This was appearing on any kind of cluster, even a cluster of 1. Commenting out workers entirely fixed the issue.

I recently upgraded from 4.2.1 to 5.0.4 so I decided to roll back my upgrade and upgrade slowly to see which version it was introduced.

I was able to run things just fine on 5.0.2, but on 5.0.3 and then on 5.0.4, the issue starts appearing.

Puma config:

bind "tcp://0.0.0.0:3000" # Bind to any IP address so we can connect over wifi from CNTRAL Mobile devices to the development machine's local IP address.
workers 1

To Reproduce
I tried reproducing this with the hello.ru app and using my puma.rb file but I couldn't reproduce it there. The page requests were always fast. So it may be application, or Rails, specific but I can definitely reproduce it and the issue first appears in 5.0.3.

5.0.2 is completely fine.

Expected behavior
Quick requests, like it is on 5.0.2.

Desktop (please complete the following information):

  • OS: Mac
  • Puma Version 5.0.3
  • Rails 5.2.4.4
  • Ruby 2.6.5

My apologies for the lack of information and reproducibility but I've narrowed it down the best I could to a single patch release, so hopefully we can dig deeper from here.

@cjlarose
Copy link
Member

cjlarose commented Nov 7, 2020

I also notice that it takes much longer to quit / exit the server. Like 15 seconds instead of 0.5.

That issue is separately documented in #2398 and is fixed in puma master.

The slow requests problem seems separate, but difficult to debug without a reproducible test case. Have you tried running your application with puma on the master branch? It would be good to know if the issue is present there, too, or if it has since been fixed.

gem 'puma', github: 'puma/puma', branch: 'master'

@joshuapinter
Copy link
Author

joshuapinter commented Nov 7, 2020

Thanks for the response, @cjlarose.

I tested both issues on master and they are both still present. The hanging on quit is not resolved, whatever may be causing it.

Just to give you a better idea of what I'm talking about, here's what I'm seeing, without giving anything too private away. Postman on left, Terminal console log on the right.

To summarize:

  • There are no changes to the application or the request, other than changing the puma version and the workers configuration.
  • Single mode works on all versions.
  • Clustered mode working correctly on 5.0.2.
  • Clustered mode not working correctly on 5.0.3 or in master.
  • By "not working correctly" there are two issues:
    1. The first request is fine but the subsequent request takes 30 seconds. (I've tested this and it doesn't even hit the app during that initial waiting period.)
    2. It takes 15 - 30 seconds to quit the server.

5.0.2

workers 1

Screenshot 2020-11-07 at 10 56 40

5.0.3

workers 1

Screenshot 2020-11-07 at 10 47 49

# workers 1

Screenshot 2020-11-07 at 10 55 26

master

workers 1

Screenshot 2020-11-07 at 09 41 03

# workers 1

Screenshot 2020-11-07 at 09 45 09

@joshuapinter joshuapinter changed the title Second Request Very Slow in Clustered Mode on 5.0.3 and 5.0.4. Second Request Very Slow in Clustered Mode on 5.0.3 and master. Nov 7, 2020
@joshuapinter joshuapinter changed the title Second Request Very Slow in Clustered Mode on 5.0.3 and master. Second Request Very Slow in Clustered Mode on 5.0.3, 5.0.4 and master. Nov 7, 2020
@cjlarose
Copy link
Member

cjlarose commented Nov 7, 2020

Ok so it seem like it might be a regression introduced in 5.0.3. Since we don't have a public reproducible test case, it would helpful if you could perform a git bisect on puma. If you clone the puma repository locally, you can point your own application to use your local version, start a bisect, and let git tell you the commit where the regression was introduced.

After cloning the puma repository, in the puma directory,

git checkout v5.0.2
bundle install
bundle exec rake compile

In your application's Gemfile:

gem 'puma', path: `/path/to/your/local/puma`

Verify that you get the behavior you expect on v5.0.2. Then in puma

git bisect start
git bisect good
git checkout v5.0.3

Restart your application and verify that it's bad.

git bisect bad

From that point, bisect should check out candidate commits until it narrows down the commit where the regression was introduced. Each step of the way, you can test your application and execute either git bisect good or git bisect bad.

Let me know if anything's not clear!

@joshuapinter
Copy link
Author

Will do! I didn't even know git bisect was a thing - neat!

I'll make some time this weekend to run through this and let you know my results.

Thanks for the help.

@cjlarose
Copy link
Member

cjlarose commented Nov 7, 2020

Ah okay! git bisect is super useful for this kind of thing! The docs on git-scm are probably better than my short tutorial if you're new to it: https://git-scm.com/docs/git-bisect

@joshuapinter
Copy link
Author

joshuapinter commented Nov 13, 2020

Hi @cjlarose,

Sorry for the delay. I used git bisect to narrow down the commit where the regression was introduced - extremely handy tool btw!

Here is the results of the investigation: a76d390

a76d3905d89780840b8cf4407c51c1ba384812c7 is the first bad commit
commit a76d3905d89780840b8cf4407c51c1ba384812c7
Author: Will Jordan <wjordan@users.noreply.github.com>
Date:   Tue Oct 6 06:22:53 2020 -0700

    Reactor refactor (#2279)
    
    * Refactor Reactor and Client request buffering
    Refactor Reactor into a more generic IO-with-timeout monitor,
    using a Queue to simplify the implementation.
    Move request-buffering logic into Server#reactor_wakeup.
    Fixes bug in managing timeouts on clients.
    Move, update and rewrite documentation to match updated class structure.
    
    * Fix a few concurrency bugs
    - In `Reactor#shutdown`, `@selector` can be closed before the call to `#wakeup`, so catch/ignore the `IOError` that may be thrown.
    - `Reactor#wakeup!` can delete elements from the `@timeouts` array so calling it from an `#each` block can cause the array iteration to miss elements. Call @block directly instead.
    - Change `Reactor#add` to return `false` if the reactor is already shut down instead of invoking the block immediately, so a client-request currently being processed can continue, rather than re-adding to the thread-pool (which may already be shutting down and unable to accept new work).
    
    Co-authored-by: Nate Berkopec <nate.berkopec@gmail.com>

 History.md              |   2 +
 lib/puma/client.rb      |  27 ++-
 lib/puma/queue_close.rb |  24 +++
 lib/puma/reactor.rb     | 434 ++++++++----------------------------------------
 lib/puma/server.rb      | 202 +++++++++++-----------
 5 files changed, 221 insertions(+), 468 deletions(-)
 create mode 100644 lib/puma/queue_close.rb

I know very little about actually fixing this but I can reproduce it very readily now so let me know how I can help in that regard.

Thanks for looking into this!

Joshua

@dentarg
Copy link
Member

dentarg commented Nov 13, 2020

@joshuapinter looks like great work with bisect!

Have you tried to reproduce this with a stripped down version of your app your app that you can share?

I wonder what could be causing this, and without a repro I feel we are still in the dark

@joshuapinter
Copy link
Author

Hey @dentarg.

I hear you. That refactor commit is fairly large so tough to nail down what might be the issue.

Our app is a fairly hefty Rails 5.2 app. Tough to break out without some considerable work.

Not sure if there's anything else I can test via binding.pry or puts statements.

If not, then I'll just have to put this on the back burner until I have more time to create a reproducible example.

@nateberkopec
Copy link
Member

paging @wjordan just in case something about this bug description + having your reactor refactor in mind immediately jumps out at you

@wjordan
Copy link
Contributor

wjordan commented Nov 14, 2020

This will be hard to track down without narrowing down the issue to something we can reproduce locally. Any chance you can repro the actual requests to the rails app with curl at least, to rule out anything unusual going on with the browser-based requests?

@joshuapinter
Copy link
Author

joshuapinter commented Nov 14, 2020 via email

@MSP-Greg
Copy link
Member

@joshuapinter

Thanks for investigating this. I've been working on a test framework rewrite, part of its purpose is to make it easier to write tests for issues like this.

Running Puma with 0:5 threads and two workers, with an 'Hello World' app. I created 20 threads, each creating 100 client sockets, each sending two requests spaced 200 ms apart. I started at 10 threads, and all passed, but with 20, macOS was not happy. Anyway, Ubuntu had the following output. Note times are in seconds, totaled for all requests:

TestTwoRequests#test_two_requests = 
20 threads, each with 100 clients
389.27 1st time
 20.24 2nd time
  2000 1st qty
  2000 2nd qty

41.53 s = .

So, I can't see the long response time issue. If anything, it may indicate that the 'accept' processing may slow down for some reason.

Obviously, something is causing your problem. One thing you might try. Using workers 1 is uncommon, as the benefits with workers happen when it's two or more...

@joshuapinter
Copy link
Author

Thanks for the feedback @MSP-Greg.

Some new updates:

  • Changed to workers 4.
  • Can still reproduce issue.
  • Could not reproduce in cURL. Seems odd.
  • So I used Net::HTTP and was able to reproduce it.

Here's the kicker...

  • We upgraded from Ruby 2.6.5 to 2.7.2 during a Friday evening Maintenance Window and I can't reproduce the issue on Ruby 2.7.2. On Postman, cURL or Net::HTTP.
  • Switching back to 2.6.5, the issue returns.

So, not sure what to tell you guys from here. I'm happy to close this now and see if anybody else encounters it, since it's not an issue on Ruby 2.7.2. But I'm also happy to test anything else for you if you want to drill deeper.

Let me know and thanks for the consideration. 🙏

@MSP-Greg
Copy link
Member

Interesting. I ran CI using Ruby 2.5.7, 2.6.5, & 2.7.1, and results were similar to the above. Maybe I'll review the changelogs of the recent Ruby versions...

Puma isn't used as an isolated app, and it's obviously often used with Rails. If there's an issue with particular versions of Ruby/Rails/Nio4r, etc, we're interesting in knowing about them.

So, close if you'd like, @nateberkopec may have something to add. He's a Rails guru, I am not...

Thanks for investigating further.

@joshuapinter
Copy link
Author

Okay, scratch that. It is still happening on Ruby 2.7.2. It took a few more requests for the issue to re-appear, which is why I didn't notice it right away, but it's exhibiting the same issue.

We'll keep our puma version pinned to 5.0.2 but let me know what else I can do to help diagnose.

@dentarg
Copy link
Member

dentarg commented Nov 15, 2020

It took a few more requests for the issue to re-appear

So it is not always the second request that is very slow?

@joshuapinter
Copy link
Author

@dentarg It appeared to be always the 2nd request when running Ruby 2.6.5 but with Ruby 2.7.2 it appeared to vary between 2nd, 3rd and 4th. It's strange, sorry. Terrible bug report. :/

@lucascaton
Copy link

My 2 cents: I'm also having this problem (I'm using Ruby 2.7.2 and Rails 6.0.3.4). I ended up downgrading to Puma 5.0.2 for now.

Sorry for not having more information to help more, just wanted to share that @joshuapinter isn't the only one having problems with the latest versions 🙂

@joshuapinter
Copy link
Author

200 (3)

@wjordan
Copy link
Contributor

wjordan commented Nov 15, 2020

@joshuapinter just to clarify, the issue reproduces in 2.7.2 only with postman/Net::HTTP (can you share the exact code used to reproduce with Net::HTTP?), but still can't reproduce with curl at all?

Also, is it a specific request path/route that reproduces the issue and not any other routes? Even if the app is proprietary and you can't narrow down the code into something you can share publicly, is there anything else about the particular route that triggers the issue that is unique or that you're able to share?

Finally, what's the specific Mac OS version you're seeing the issue on?

@lucascaton it would be helpful if you could share more specific detail on the problem you're experiencing:

  • OS (including specific OS version)
  • Server code used (if public; if not, you could try making a minimal test app that reproduces the issue and share that instead)
  • How the requests were made (curl, browser, net::http ruby code, end-user reports from a web-app etc)
  • whether the issue reproduces consistently 100% of the time or if it's intermittent/random, and if it's always on the second request after the server boots
  • any puma/rails logs, browser network-request logs, etc that you can capture+share

@MSP-Greg
Copy link
Member

Speaking of logs, adding log_requests (or quiet false) to the config file will add request logging that shows 'app' time. Since this seems related to the Puma version used, one would think the app isn't causing the issue, but logging requests would confirm that.

Today I ran Puma as a separate process (not a sub process as in CI), and ran three requests per client/socket. No real difference in times, and I was running loops of 800. IOW, I'm trying to repo, but I can't...

@vipulnsward
Copy link
Contributor

vipulnsward commented Dec 4, 2020

My 2 cents: I'm also having this problem (I'm using Ruby 2.7.2 and Rails 6.0.3.4). I ended up downgrading to Puma 5.0.2 for now.

We seem to be seeing same issue with this config(Ruby 2.7.2 and Rails 6.0x). Being on puma 5.0.2 fixes and makes puma "just run" on development env for us. But only when we downgrade puma to < 5x do all our issues go away.

Edit:

Just to note that we have been facing this issue for a while, but we thought it to be strange or something wrong with our own setup, etc. as it was sporadic. Noting this because many teams might be facing it but feeling something might be wrong with their own setup and not reporting it as there's no errors reported from puma side.

@dentarg
Copy link
Member

dentarg commented Feb 6, 2021

@lucascaton @vipulnsward @joshuapinter does the problem still exist in Puma 5.2? can you provide any more details? see #2484 (comment)

@lucascaton
Copy link

lucascaton commented Feb 7, 2021

Hey @dentarg - I've just tested here, all good when using version 5.2!

UPDATE: After a few days running, the issue started again 😞

@joshuapinter
Copy link
Author

Hey @dentarg. I tested 5.2.1 and still seeing the same issue. Here's the videos with puma 5.0.2 compared with puma 5.2.1.

5.0.2 (baseline)

5.0.2.mp4

5.2.1

5.2.1.mp4

The only thing changing between these two videos is the version of puma being used.

I know it's difficult to reproduce with an isolate case, but I can easily reproduce it on our side if there's anything else you want us to test or diagnose.

Thanks!

@MSP-Greg
Copy link
Member

MSP-Greg commented Feb 7, 2021

@joshuapinter

Thanks for the reply. In the mp4's above, is the only 'client' hitting Puma the 'client app' you show? If so, that means the request count is low, so, could you please turn on Puma request logging?

I've been using Puma quite a bit locally, both testing with 'high volume' code generated requests and as a local server for the files used for https://msp-greg.github.io/, both tcp & ssl. I have never seen request timing like you're seeing.

I don't/can't speak for the other maintainers, but without a means to repo this, or more info, I can't be of any help...

JFYI, I've been working on a new test framework, which allows generating lots of requests. Just tested 10 threads, each creating 200 clients (all keep-alive), each client sending 10 requests. Tested with both 10kb & 200kb response bodies. With the 200kB bodies, 95% of the responses were complete in under 13mS.

You're seeing timing that's orders of magnitude beyond that...

@joshuapinter
Copy link
Author

@MSP-Greg Yes, the only requests being sent are when I hit the "Send" button from Postman. As you can see, the first one is fast one both versions. All subsequent requests are fast on 5.0.2. But subsequent requests on 5.2.1 (it actually started in 5.0.3) are extremely slow. Additionally, shutting down 5.0.2 is fast whereas 5.2.1 hangs for awhile before quitting.

@MSP-Greg
Copy link
Member

MSP-Greg commented Feb 7, 2021

Thanks for the clarification. We don't need an mp4 of it, but if you turn on logging, what times are shown in the Puma logging for the slow requests?

@joshuapinter
Copy link
Author

@MSP-Greg This is with log_requests true and debug in the puma.rb configuration.

5.2.1.with.debug.mp4

@MSP-Greg
Copy link
Member

MSP-Greg commented Feb 7, 2021

@joshuapinter

Thanks, that helps. Most of the testing I've been doing is with GET requests, I'll try it with POST requests...

@dentarg
Copy link
Member

dentarg commented Feb 7, 2021

@joshuapinter can you try the same thing again in Puma 5.2 but max threads 2 or more? Thanks :)

@dentarg
Copy link
Member

dentarg commented Feb 7, 2021

@joshuapinter I also think it would be helpful if you can share thread-backtraces output from your app: before first request, after first request but before the second request, when waiting on the second request, after the second request (I'm not sure all those instances are important, but let's be thorough)

You can can use pumactl, see #2398 (comment), or you can send SIGINFO to Puma (kill -SIGINFO <pid>, example)

Also, are you able to reproduce the issue with curl or is it only with Postman?

@joshuapinter
Copy link
Author

Thanks for the responses, I'll do my best to make some time this week for further testing and get you what you need.

Also, are you able to reproduce the issue with curl or is it only with Postman?

I wasn't able to reproduce with curl. Only Postman and raw Net::Http.

@dentarg
Copy link
Member

dentarg commented Feb 8, 2021

@joshuapinter please share that code using net-http if you can, I think it would be helpful

@joshuapinter
Copy link
Author

@dentarg Definitely will.

@lucascaton
Copy link

lucascaton commented Aug 1, 2021

Hi there. Just to let you know that this is still happening in Puma v5.4.0.
I'm deploying/running it via Dokku. After each request, Puma uses more memory. Perhaps there's a memory leak?

@dentarg
Copy link
Member

dentarg commented Aug 1, 2021

@lucascaton can you provide the information that was asked about in #2484 (comment)?

@lucascaton
Copy link

Hey @dentarg, sure!

OS (including specific OS version)

Ubuntu via Dokku, which is powered by Docker and herokuish.

❯ dokku run uname -a
Linux b94572e9248a 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Note: I’m running Puma in Cluster Mode

Server code used

The code isn’t public/open-source and unfortunately, I couldn’t reproduce a minimal test app 😞

How the requests were made?

Requests via browser by real users.

Whether the issue reproduces consistently 100% of the time or if it’s intermittent/random, and if it’s always on the second request after the server boots

It’s consistently 100% of the time but it’s never the 2nd request. In fact, for me, it takes 1-2 days for the issue to start happening.

Any puma/rails logs, browser network-request logs, etc that you can capture+share

I believe the reason why the requests get very slow (at least in my case) is that it uses all available memory in the server (2GB) + its Swap (4GB). Using Puma 5.0.2 (the last version that didn’t have this issue), this app uses ~ 750MB, even after several days running.

Please let me know if I can help with anything else.

@wjordan
Copy link
Contributor

wjordan commented Aug 3, 2021

@lucascaton your issue sounds very different from the issue discussed in this thread. Further, I can only assume the memory leak is coming from a bug in your application/configuration and not Puma itself until you can reproduce an issue in code you can make public so others can verify.

@lucascaton
Copy link

Hey @wjordan,

your issue sounds very different from the issue discussed in this thread

Well, I agree it's different (I wasn't sure when I wrote my first comment here) but it's affected by the very same versions of Puma.

I can only assume the memory leak is coming from a bug in your application/configuration and not Puma itself

I don't think so. If I upgrade Puma from version 5.0.2 to 5.0.3+ and nothing else, the issue starts happening. When downgrading it back to 5.0.2, everything goes back to normal. I've done this a few times now.

until you can reproduce an issue in code you can make public

I wish I could be more helpful here. I've been trying to reproduce it but it's not very easy since it takes 1-2 days (and/or several requests) for the issue to start happening. I was just reporting my case in case it helps somehow because it's affected by the same versions described in this PR 🙂

I'll keep trying to reproduce the issue. In the meantime, please let me know if I can help with anything else.

@dentarg
Copy link
Member

dentarg commented Mar 31, 2023

Anyone still having issues? Anyone tried Puma 6.x?

@joshuapinter
Copy link
Author

@dentarg Yes, just upgraded to 6.2.1 and seeing the same thing. First request is lightning quick and subsequent requests take 10+ seconds.

@dentarg
Copy link
Member

dentarg commented Apr 11, 2023

@joshuapinter able to share how to reproduce? #2484 (comment)

@joshuapinter
Copy link
Author

@dentarg Unfortunately not at the moment. Slammed with other things so we're still pinned at 5.0.2 for the time being. I will if things free up but tough to find time these days.

@dentarg
Copy link
Member

dentarg commented Aug 1, 2023

@joshuapinter any chance you can just dump your Net::HTTP code that you are able to reproduce this with? so can others continue this mystery

@joshuapinter
Copy link
Author

@dentarg I'll do my best but time is super limited (as always!). This is still something I'd like to get fixed but it's not a "production" issue so it gets deprioritized. Will let you know if I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants