[Experimental] Worker based expiration strategy. #447

camilo · 2020-03-30T05:36:43Z

One of the things that make IdentityCache (IDC) hard to run at scale is how much it can
reflect load on the database during high load events, most of the load occurs due to over-fetching
during misses while the application using IDC is enduring a high thruput event.

A possible solution for this is to change our cache population model. Currently IDC will try to find
an object (and it's embeded associations) in memcached, if the object is not in the cache IDC will try
and find the objects in the database. This simple mental model seems good enough for most things, however
it causes possibly lots of fetches from multiple workers during a high trhuput event. The problem can also
be seen in memcached itself, IDC uses cas to avoid writting stale data during high thruput events the
cas calls can become quite slow [I'll find some sample data to back this up, if this goes anywhere beyond
a WIP patch, not meant to be merged].

Changing our cache population model is a big refactor, so I consider we can do this in two steps.

Change the cache expiration model
Change the cache population model

This PR is/may be an prototype to see what a worker based expiration implementation looks like.

The idea is the following.

Make the expiration strategy configurable
On the worker based expiration model simply do nothing to expire blobs inline
Implement a worker process that is supposed to be long lived and receive binlog type events
Test to see how fast ruby can process incoming events, if ut turns out to be too slow to keep
up with high trhuput events, try different concurrency models, pre-fork, threads, evented (?),
implement the worker in native code? implement the worker in a 100% diff language and export the
logic to go from blob -> key from ruby to the other language

One of the things that make IdentityCache (IDC) hard to run at scale is how much it can reflect load onthe database during high load events, most of the load occurs due to over-fetching during misses while the application using IDC is enduring a high thruput event. A possible solition for this is to change our cache population model. Currently IDC will try to find an object (and it's embeded associations) in memcached, if the object is not in the cache IDC will try and find the objects in the database. This simple mental model seems good enough for most things, however it causes possibly lots of fetches from multiple workers during a high trhuput event. The problem can also be seen in memcached itself, IDC uses `cas` to avoid writting stale data during high thruput events the `cas` calls can become quite slow [I'll find some sample data to back this up, if this goes anywhere beyond a WIP patch, not meant to be merged]. Changin our cache population model is a big refactor, so I consider we can do this in two steps. 1. Change the cache expiration model 2. Change the cache population model This PR is/may be an prototype to see what a worker based expiration implementation looks like. The idea is the following. 1. Make the expiration strategy configurable 2. On the worker based expiration model simply do _nothig_ to expire blobs inline 3. Implement a worker process that is supposed to be long lived and recevi binlog type events 4. Test to see how fast ruby can process incoming events, if ut turns out to be too slow to keep up with high trhuput events, try different concurrency models, pre-fork, threads, evented (?), implement the worker in native code? implement the worker in a 100% diff language and export the logic to go from blob -> key from ruby to the other language

camilo · 2020-03-30T05:52:33Z

cc @ignacio-chiazzo

dylanahsmith · 2020-03-30T19:20:31Z

bin/expiration_worker

+# This is meant to run a long lived process along these lines 
+# Parse args from command line / env then call
+#
+# IdentityCache::BinlogExpirationWorker.new(*args).run


In a previous hack day project @pushrax and I explored identity cache binlog based expiration. It consisted of two parts, one part to create expiration rules from the cached associations (https://github.com/Shopify/shopify/compare/identity_cache_expiry_rules) and a binlog reader that used those rules to invalidate the cache (https://github.com/Shopify/binlog-cache-invalidator).

oh cool thanks @dylanahsmith ! @ignacio-chiazzo and I might take a stab at a second expirment using CDC instead of raw logs but I think the code/principles can be adapted.

If CDC is more likely to be delayed (because the message needs to flow from the binglog and to kafka and then to consumer), wouldn't we have more benefits of getting closer to the metal and consuming from the raw binlog?

Did someone say CDC? 😁 anyway we have an SLO [REDACTED] -- TL;DR is that the Kafka+Sieve parts add around 500ms on top of the MySQL replication lag. The MySQL replication lag is by far the single biggest contributor to the delay. We intend to make CDC read from writers to remove the MySQL replication lag, but we're not there yet.

This is the steady state, if there were an incident or a shop move then (for a given shop's data) you would see delayed records. That said, we're solving a bunch of hard problems for you, so doing things "raw" means having to keep state on the MySQL schema and (depending on your service) shop moves, etc.

If CDC is more likely to be delayed (because the message needs to flow from the binglog and to kafka and then to consumer), wouldn't we have more benefits of getting closer to the metal and consuming from the raw binlog?

Why is that a problem? I think the fundamental problem with IDC is the assumption of freshness.

dylanahsmith reviewed Mar 30, 2020

View reviewed changes

casperisfine force-pushed the master branch from 97719a0 to daaba73 Compare May 5, 2020 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Worker based expiration strategy. #447

[Experimental] Worker based expiration strategy. #447

camilo commented Mar 30, 2020

camilo commented Mar 30, 2020

dylanahsmith Mar 30, 2020

camilo Mar 30, 2020

kirs Apr 2, 2020

insom Apr 2, 2020 •

edited by camilo

camilo Apr 2, 2020

[Experimental] Worker based expiration strategy. #447

Are you sure you want to change the base?

[Experimental] Worker based expiration strategy. #447

Conversation

camilo commented Mar 30, 2020

camilo commented Mar 30, 2020

dylanahsmith Mar 30, 2020

Choose a reason for hiding this comment

camilo Mar 30, 2020

Choose a reason for hiding this comment

kirs Apr 2, 2020

Choose a reason for hiding this comment

insom Apr 2, 2020 • edited by camilo

Choose a reason for hiding this comment

camilo Apr 2, 2020

Choose a reason for hiding this comment

insom Apr 2, 2020 •

edited by camilo