Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10X increase in read ops/sec after upgrade from v6 to v7 #684

Open
jkhulme opened this issue Feb 15, 2022 · 9 comments
Open

10X increase in read ops/sec after upgrade from v6 to v7 #684

jkhulme opened this issue Feb 15, 2022 · 9 comments

Comments

@jkhulme
Copy link

jkhulme commented Feb 15, 2022

Describe the bug
We upgraded from 6.0.24 to 7.1.12 (alongside upgrade sidekiq from 6.1.2 to 6.4.0)

Before the upgrade we were seeing ~300 reads/sec, and ~300 writes/sec
After the upgrade we are seeing ~3000 reads/sec, and still ~300 reads/sec

This is a very big increase to see happen, and is proving to be quite expensive with our redis provider

Expected behavior
Not to see a 10x increase in reads/sec

Current behavior
We are seeing 10x increase in reads/sec

Worker class
n/a

Additional context
Our unique jobs config:
"unique_jobs_debug"=>false,
"unique_jobs_reaper_type"=>"ruby",
"unique_jobs_reaper_count"=>1000,
"unique_jobs_reaper_interval"=>30,
"unique_jobs_reaper_timeout"=>20,
"unique_jobs_reaper_resurrector_interval"=>3600

Please let us know if there is any other info we can provide

@mhenrixon
Copy link
Owner

Sorry about the trouble, you can tweak the reaper settings to cut the reads a bit.

Most of the work (apart from the reaper) is done in Lua so shouldn't cause too much trouble.

@mhenrixon
Copy link
Owner

For example, reaping every 30 seconds seems a little overkill. I'd try reaping every 5-30 minutes instead.

Also, some people reported issues with the resurrector which is why I default that to off. Wouldn't be many reads but if you really need to cut down on reads...

Never heard on a provider that charges per reads before but I'm happy to optimize the gem 💎

@francesmcmullin
Copy link
Contributor

Hi 👋🏻 For context, @jkhulme and I both work on the same application which depends on your gem (thank you!). We have different deployments with different scales, but a lot of our usage patterns look similar.

I took some time today to examine our Redis Reads and I believe you're right that the largest increase is coming from the reaper. It fails while our sidekiq queues are full, but once they clear out we could have 300,000 locks to clean out (@jkhulme may have millions). Unfortunately I don't expect reaping less frequently to help - in the scenario where we have 300k locks to clear, it's going to cost a lot of reads whether we clean 1k every 30 seconds or 10k every 5 minutes.

It's possible that switching to the Lua reaper would fix a lot of our problems, but I am concerned about its blocking behaviour, especially when we can build up such a large number of locks to be reaped.

Currently, when the reaper is iterating over a list of digests to find orphans, it performs the following operations per digest:

  • ZSCAN to check if the digest is scheduled
  • ZSCAN to check if the digest is retried
  • SSCAN to find sidekiq queues
  • for each sidekiq queue: (for us, this is 3)
    • LLEN for initial size
    • LRANGE (queue size / 50) times (if it's more than 1k the reaper will be too slow and fail anyway)
    • LLEN again if the queue had any entries (queue size / 50) times
  • SSCAN of sidekiq processes (this is usually just 1 for us, but I think @jkhulme has lots)
  • for each sidekiq process: (pipelined, so possibly counts as just one read)
    • EXISTS
    • HGETALL to find workers

After that, it uses a batch delete process, and it also avoids looping over every digest since #666 was merged, but even if the reaper only ever ran the above for digests it was going to delete, that's roughly 12 reads per digest, so cleaning up 300k digests will produce 3.9M reads. If that's spread evenly over 3 hours, that's still 300 reads/second. In practice, it's not spread out evenly, and we see peaks over 2k reads/second every minute or so (our reaper interval is a minute).

Lookng at the list above, there are three operations which I suggest could be saved and re-used, at least within a given run of the reaper:

  • SSCAN to find sidekiq queues
  • LLEN of each queue for initial size
  • SSCAN of sidekiq processes

That would get us from 12 down to 9. I think, ideally, we would find some way to batch some of the other operations. For example, for the two ZSCANs, we might consider testing an entire batch of keys together, by first storing them in a set (e.g., to_be_reaped) and using ZDIFFSTORE to get the keys which are not present in either the scheduled or retried sets. That would bring us from 9 down to 7, or nearly cutting reads in half.

@mhenrixon If I make a PR roughly following the strategy above, would that be helpful? I'm happy to make any further changes you suggest in terms of style/architecture/specs etc.

@jkhulme do you think a 50% reduction would get us back within the range of acceptable read traffic?

@mhenrixon
Copy link
Owner

If you have any suggestions for how to improve performance I am wide open for it @francesmcmullin. The only thing that is extremely important is to make sure that redis version compatibility isn't broken without a major release.

I'm open for increasing the required redis version but I am not sure I have test coverage for various versions.

@francesmcmullin
Copy link
Contributor

francesmcmullin commented Mar 9, 2022

Thanks for the pointer! I foolishly ignored the version number of ZDIFFSTORE. I don't think it would be appropriate to require redis version 6.2.0 (and therefore a gem bump to sidekiq unique jobs 8) only to fix this issue, elegant though that solution would be.

I have a very rough changeset with passing specs here for reducing the SSCAN and LLEN calls, but having looked at the problem more closely I'm interested in zooming out a little bit.

I think the root of the problem, for @jkhulme and me, is that we use an enormous amount of until_expired locks and they don't clean themselves up. For every other kind of lock, as far as I can tell, the normal path is for the lock to be released in response to some event (a job completing, or starting), and the reaper is only involved in unusual/problematic cases. However, for until_expired, most of the lock has a TTL and expires after the given time, the only exception is the digest, which cannot be expired as a member of a sorted set. I suggest that the reaper is not an efficient method for deleting locks as standard, it serves very well for cleaning up in unusual cases, but it is too expensive to be the normal path.

By its nature, the reaper must thoroughly check if digests are queued, retrying, active or scheduled, because it is a safety net cleanup system, which knows nothing about the locks related to these digests. However, none of that is necessary for an until_expired lock/digest. This leads me to some more controversial questions...

  • What is the purpose of the digests zset? Is it purely for transparency/logging? How is it superior to scanning for actual locks?
  • What if we didn't store until_expired digests in the usual digests zset? They could be omitted entirely, or have a separate zset with different rules - I suggest setting the score to the expiry time, rather than the creation time.

For us, I'm pursuing a simpler SETNX lock for our highest volume use case, rather than using until_expired, but I'd love to switch back if we could be confident the locks would be cleaned up without excessive extra load.

@mhenrixon
Copy link
Owner

@francesmcmullin, there is nothing per se keeping us from having all the digests in one place. I have the digests set to give some insight into what's going on without having to loop through all keys.

I'll give this some thought. Would be pretty sweet to use a more efficient reaper. Perhaps you could contribute one suitable for your use case and we can prefer that when the redis version is appropriate and fallback to the existing one if not?

I would probably duplicate the existing ruby reaper and make changes to a separate class to avoid adding if statements in the existing one.

@mhenrixon
Copy link
Owner

@francesmcmullin I'd love to get some improvements in to the gem. Did you do any more improvements? Any PR you could open? I appreciate your support!

@francesmcmullin
Copy link
Contributor

Hi @mhenrixon 👋🏻 Thanks for the nudge! I'll open a few PRs today, please treat them as suggestions only, I won't mind if they need more work or don't fit with the gem's broader use cases etc.

@mhenrixon
Copy link
Owner

@francesmcmullin you just really inspired me. I will reduce the complexity considerably and create different lua scripts for different locks and also different sets for different locks.

I believe you just made v8 happen :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants