Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-actionable warnings about RTT #4851

Closed
MGPalmer opened this issue Mar 24, 2021 · 17 comments
Closed

Non-actionable warnings about RTT #4851

MGPalmer opened this issue Mar 24, 2021 · 17 comments

Comments

@MGPalmer
Copy link

MGPalmer commented Mar 24, 2021

Ruby version: 2.7.2
Rails version: 6.1
Sidekiq / Pro / Enterprise version(s): 6.2.0

sidekiq.yml:

---
:verbose: true
:concurrency: <%= ENV.fetch('SIDEKIQ_CONCURRENCY', 5) %>

# Set timeout to 8 on Heroku, longer if you manage your own systems.
:timeout: 8
:queues:
  - ['critical', 6]
  - ['high', 4]
  - ['default', 2]
  - ['low', 1]


Hello!
I'm a little worried about the recently introduced warnings about RTT: #4824
I noticed this warning showing up in our logs several times a day, usually with around this range:

Mar 23 23:32:58 mge-application app/worker.1 pid=4 tid=2lps WARN: Your Redis network connection is performing extremely poorly.
Mar 23 23:32:58 mge-application app/worker.1 Current RTT is 95404 µs, ideally this should be < 1000.
Mar 23 23:32:58 mge-application app/worker.1 Ensure Redis is running in the same AZ or datacenter as Sidekiq.

However, usually the RTT is somewhere between 800 and 5000. The thing is, we're on Heroku and have basically no control over the Redis instance except the plan size (we have a medium-range "premium-5" instance). Sidekiq jobs seem to be handled reliably and speedily, no complaints.

So this warning is currently just noise to us. Is there a way to turn them off? Or am I missing something important here?

Thanks!
(BTW, love Sidekiq and your work, thanks :))

@jwilsjustin
Copy link

Yeah. For deployments on Heroku there is no way for us to have a guaranteed AZ. See here. Maybe logging this as INFO would still suffice for those who want it?

@MGPalmer
Copy link
Author

Ah thanks. Hmm the principle is fine :) From the top of my head, I would like to keep the warnings but configure the threshold to our case. Making it possible to override RTT_WARNING_LEVEL via ENV var and/or sidekiq.yml would be great for us.

@mperham
Copy link
Collaborator

mperham commented Mar 24, 2021

Perhaps I shouldn't be taking one reading and WARNing based on it. I should be taking 3-5 readings over 30 seconds before logging anything, that would minimize log noise due to transient spikes.

I avoid config switches as they add code complexity.

@jwilsjustin
Copy link

+1 for that, @mperham.

@MGPalmer
Copy link
Author

I guess that would work for us, too, the warnings for today for example are usually minutes up to an hour apart.

@PhilCoggins
Copy link

PhilCoggins commented Mar 30, 2021

I am also on Heroku and have just started to notice these in my logs. Some of the values are very high:

Mar 29 07:14:46 fleetio app/sidekiq.2: Current RTT is 16703645 µs, ideally this should be < 1000.

If I'm not mistaken, this is a 16 second ping (not full request) from my Sidekiq server to Redis? I have opened a support request with Heroku, as this is pretty bad.

Would it be reasonable to correlate consistently high values with ERROR: heartbeat: Connection timed out? And is it possible for jobs to be dropped when seeing these errors?

UPDATE: I averaged the RTT values in my logs over the past 24 hours and came up with 247302.

@mperham
Copy link
Collaborator

mperham commented Mar 30, 2021

@PhilCoggins That's awful. If you are seeing consistently poor performance, I would explain to Heroku Support about the poor latency and ask them to fail you over to a new Redis instance. Something is terribly wrong with that one.

mperham added a commit that referenced this issue Mar 30, 2021
@mperham
Copy link
Collaborator

mperham commented Mar 30, 2021

I've updated master to take 5 samples and only warn if all five samples are above the threshold.

@mperham mperham closed this as completed Mar 30, 2021
@MGPalmer
Copy link
Author

Thanks everyone!

@edmorley
Copy link

edmorley commented Apr 6, 2021

@mperham I don't suppose it would be possible to publish a new sidekiq release to pick up 5b94bfe? A number of customers are opening tickets and presumably many (if not all) are due to transient spikes, rather than consistent slow RTT.

@mperham
Copy link
Collaborator

mperham commented Apr 6, 2021

@edmorley Can you explain more? Is there some aspect that makes this high priority? I have one other thing I'm still looking into but it's possible I can release later this week.

@edmorley
Copy link

edmorley commented Apr 6, 2021

@mperham Just that the message in 6.2.0 can be the result of a temporary false positive, rather than a consistently high RTT, and the new sampling approach will eliminate the noise from those. Customers open tickets with "sidekiq says there is a problem with my Redis instance", and after investigation there is no issue with the Redis instance, and the ping is typically low.

@mperham
Copy link
Collaborator

mperham commented Apr 6, 2021 via email

@mperham
Copy link
Collaborator

mperham commented Apr 8, 2021

6.2.1 is out.

@edmorley
Copy link

edmorley commented Apr 8, 2021

Thank you :-)

@soma
Copy link

soma commented Apr 8, 2021

❤️

@MGPalmer
Copy link
Author

MGPalmer commented Apr 8, 2021

Looks like it's working :) Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants