Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable refresh #504

Open
ben-manes opened this issue Feb 21, 2021 · 39 comments
Open

Variable refresh #504

ben-manes opened this issue Feb 21, 2021 · 39 comments

Comments

@ben-manes
Copy link
Owner

ben-manes commented Feb 21, 2021

Currently a fixed duration in refreshAfterWrite is supported. An evaluation per entry, similar to expireAfter(Expiry), would provide a more flexible approach. In #261 this would evaluate on create, read, and update. In #498, it was also requested to evaluate on the exceptional case, where the entry remains unchanged.

Implementation notes

A quirk in Guava (and replicated in Caffeine for compatibility) is that expireAfterWrite and expireAfterAccess may both be set and honored. As refreshAfterWrite piggybacks on the write timestamp, we would have to extend the codegen to 3 timestamps. That isn't desirable or useful, so instead we should restrict this specific combination and reuse the 2 timestamp variations already generated (e.g. for expireAfter and refreshAfterWrite). Setting both fixed and variable refresh should be disallowed. This approach only requires retrofitting methods onto existing generated cache entries to map to the correct timestamp field.

This issue consolidates #261, #272, #360, and #498.
@denghongcai @eugenemiretsky, @anuraaga, @musiKk, @minwoox, @ikhoon, @barkanido, @wcpan

@StephenFlavin
Copy link

StephenFlavin commented Apr 16, 2021

Hi Ben,
Apologies I'm new enough to this repo that appears this issue has a long history.
I came looking at caffeine hoping that there would be a expireAfterCondition feature, is that what you're hoping to create or is this purely for time based expiration?
Happy to raise a seperate issue for this functionality to explain my use case etc.

Edit: #536

@ben-manes
Copy link
Owner Author

Hi @StephenFlavin,

You can open another issue or a discussion topic. All of our expiration / refresh operations are time-based. Previously these were on fixed schedules, e.g. 5 minutes from the entry's last write. For expiration it is now a flexible time-based schedule, which can vary per entry. The request here is for a similar capability on when a refresh is triggered based on a custom time-based schedule. If you have a non-time based conditions, then that should be a different thread.

@cruftex
Copy link
Contributor

cruftex commented Nov 1, 2021

cache2k supports variable time refresh via its ExpiryPolicy. There is also a ResiliencePolicy to handle the exceptional case differently. There are still shortcomings. Recently I had an idea for a RefreshPolicy that looks quite promising. See the issue comment. I am commenting this for some inspiration where things might lead.

Coming up with a good policy interface might be challenging. It needs to be combined somehow with the expiry policy, since the expiry in Caffeine controls when the entry is finally removed from the cache (in case my understanding is correct). A larger refresh interval would typically make sense only with a larger expiry after write interval. In cache2k I don't have that problem since the expiry time is logically the same time when the entry needs refreshing. One timestamp or timer is sufficient.

A possible approach might be to disallow the existing ExpiryPolicy and introduce another policy method that is called once the refresh timer goes off to determine the final expiry time if no access happens.

Hopefully that is supportive and makes sense somehow, I am not too deep into every Caffeine detail.

@cruftex
Copy link
Contributor

cruftex commented Feb 11, 2022

Meanwhile I spend some more thoughts on this and started hacking on an improved refresh ahead infrastructure in cache2k.

At the moment my line of thought is that the variable refresh needs to be coupled with the expiry policy and would calculate a refresh time, depending on the expiry time, so basically it is a function of the expiry time:

long refreshTime(long expiry);

The method would be called after the expiry is calculated.

If the method only calculates the refresh time based on a percentage from the expiry, then a simple parameter would be the better option. For example Oracle Coherence is using a percentage parameter: https://docs.oracle.com/cd/E14526_01/coh.350/e14510/readthrough.htm#COHGS192

For a good design and separation of concerns I don't think the variable refresh policy should get the actual data, key and value, because the expiry policy should take the lead here.

Now the question is: What smart things could a policy do besides calculating a percentage?

@tsutsu
Copy link

tsutsu commented Mar 25, 2022

It seems strange to me that refresh and expire policies are calculated out-of-band of loading. I have a strong suspicion that the only real use-case for per-entry caching policies is 1. with a LoadingCache where 2. policies are computed/discovered as a byproduct of the load process.

In such a case, I'd expect a very different API: one where a CacheLoader isn't expected to directly return the value to be cached, but instead returns a proxy object — a CacheEntry holding at least the value, but also optionally expire and/or refresh policies, e.g.

return CacheEntry.complete(value).expireAfter(dur1).refreshAfter(dur2);

With such a CacheEntry API, policies could also be set for load failures:

// load failed! don't try loading again for at least dur2
return CacheEntry.completeExceptionally(new FooException(...)).expireAfter(dur1).refreshAfter(dur2);

...which would allow for "persistent exceptional states", where calling cache.get(k) before the cache period expires would get a cached exception thrown at the caller, without the CacheLoader being called at all.

This basically mirrors how HTTP caching works, where the caching policy is set by the HTTP response headers.

@ben-manes
Copy link
Owner Author

ben-manes commented Mar 25, 2022

Yes, that would make for a more complete cache loader to control all policy aspects. However, that becomes cumbersome to work with as it includes expiration, refresh, and weight semantics. For reuse that would then fall onto the implementor to determine shared abstractions, e.g. a memory weigher, since it is all bundled into a single callback in that api. In that approach the cache won't know upfront what features are being used. That unfortunately means that an implementation has to enable all of the policies, with memory overhead like per-entry fields. The separation of concerns makes the full, rich case harder but provides more focused apis and allows for internal optimizations.

@ben-manes
Copy link
Owner Author

ben-manes commented Apr 3, 2022

Here is my thought process for the interface design as I am sketching it out. I believe the ✔️ cases handle the requirements and concerns expressed here and on the prior discussions.

A lambda-friendly approach ❌

interface Refresher<K, V> {

  /**
   * Returns the duration after which the entry is eligible for an automatic refresh.
   * 
   * @param entry the cached key-value pair with its metadata
   * @param oldValue the value prior to the entry being updated, or null if the initial mapping
   * @param exception the exception that occurred during reload, or null if none 
   */
  Duration refreshAfter(Policy.CacheEntry<K, V> entry, 
      @Nullable V oldValue, @Nullable Throwable exception);
}

This has the nice property of being succinct, flexible, and lambda / method-ref friendly. This method supports create, update, reload success, and and reload failure. It does not support updating the refresh interval after a read, whereas expiration does support this time-to-idle option. But I don't think there is much (or any?) use-case to justify that handling that scenario.

Unfortunately Expiry uses long return values due to being called very frequently, e.g. expireAfterRead, so without value types the allocation of a duration was undesirable. Here since only one refresh per entry may be in-flight it will be invoked much less frequently, so we can safely prefer usability over concerns about GC churn. Therefore the much nicer Duration is acceptable.

Similarly the CacheEntry allocation is acceptable for the above reasons. This exposes the current mapping and its metadata, so that the interval can be responsive to the current expiration time.

A flaw in this approach is that while users can differentiate between operations, it is not very obvious. See for example,

Duration refreshAfter(Policy.CacheEntry<K, V> entry, V oldValue, Throwable exception) {
  if (exception != null) {
    // Backoff and retry soon
    return Duration.ofSeconds(30);
  } else if (oldValue == null) {
    // created; half remaining time-to-live
    return entry.expiresAfter().dividedBy(2);
  } else if (entry.refreshableAfter().isNegative()) {
    // automatic reload; half remaining time-to-live
    return entry.expiresAfter().dividedBy(2);
  }
  // explicit update; do nothing
  return entry.refreshableAfter();
}

A richer type ✔️

Therefore, it seems more idiomatic to define a type where each operation is clearly laid out. The reload success / failure cases in refreshAfterReload could be combined by having nullable parameters, similar to CompletableFuture#handle((r, e) -> r). However, since we're already defining a non-functional interface to split out operations, then it seems appropriate to call it out directly.

interface Refresher<K, V> {

  /**
   * Returns the duration after which the entry is eligible for an automatic refresh after the
   * entry's creation.
   * 
   * @param entry the cached key-value pair with its metadata
   */
  Duration refreshAfterCreate(Policy.CacheEntry<K, V> entry);

  /**
   * Returns the duration after which the entry is eligible for an automatic refresh after the
   * replacement of the entry's value due to an explicit update.
   * 
   * @param entry the cached key-value pair with its metadata
   * @param oldValue the value prior to the entry being updated
   */
  Duration refreshAfterUpdate(Policy.CacheEntry<K, V> entry, V oldValue);

  /**
   * Returns the duration after which the entry is eligible for an automatic refresh after the
   * replacement of the entry's value due to a reload.
   * 
   * @param entry the cached key-value pair with its metadata
   * @param oldValue the value prior to the entry being updated
   */
  Duration refreshAfterReload(Policy.CacheEntry<K, V> entry, V oldValue);

  /**
   * Returns the duration after which the entry is eligible for an automatic refresh after the
   * value failed to be reloaded.
   * 
   * @param entry the cached key-value pair with its metadata
   * @param exception the exception that occurred during reload
   */
  Duration refreshAfterReloadFailure(Policy.CacheEntry<K, V> entry, Throwable exception);
}

The previous example implementation is then defined as,

Duration refreshAfterCreate(Policy.CacheEntry<K, V> entry) {
  // created; half remaining time-to-live
  return entry.expiresAfter().dividedBy(2);
}

Duration refreshAfterUpdate(Policy.CacheEntry<K, V> entry, V oldValue) {
  // explicit update; do nothing
  return entry.refreshableAfter();
}

Duration refreshAfterReload(Policy.CacheEntry<K, V> entry, V oldValue) {
  // automatic reload; half remaining time-to-live
  return entry.expiresAfter().dividedBy(2);
}

Duration refreshAfterReloadFailure(Policy.CacheEntry<K, V> entry, Throwable exception) {
  // backoff and retry soon
  return Duration.ofSeconds(30);
}

Handling retries ✔️

I don't think this is the responsibility of the cache to handle this natively, as long as we make it possible for users to implement the behavior that is required.

Most cases can be handled by a more general library like Failsafe by making the reload logic support backoff and retrials. That can be implemented either synchronously or asynchronously by overriding the corresponding reload method in CacheLoader. For example,

CompletableFuture<V> asyncReload(K key, V oldValue, Executor executor) {
  RetryPolicy<V> retryPolicy = RetryPolicy.builder()
      .onFailedAttempt(attempt -> logger.warn("oops", attempt.getLastException()))
      .withDelay(Duration.ofSeconds(10))
      .withMaxAttempts(3)
      .build();
  return Failsafe.with(retryPolicy).getAsync(() -> ...);

A possible concern is that does extend the in-flight time of the reload and makes it more likely to reach the expiration threshold and be automatically removed. In that case one could track it separately by maintaining a key to retry count in a seperate Map<K, Integer>. It would require some coordination, e.g. remove the retry mapping upon success or if the entry was removed. That approach requires modifying the policy times, refresh's return value, error suppression, etc making it intertwined with all of the configured behavior. That's possible at the outer class that configures the cache (refresh policy, cache loader, removal listener, etc) as the behavior is not easily separated out into its own concept and deeply tied into the business logic's requirements.

@Meta39
Copy link

Meta39 commented Jul 11, 2022

Hello, my English is not very good. I hope the translation is correct. Can you set the expiration time when setting key value pairs? For example: cache.put(key, value,overtime);

@ben-manes
Copy link
Owner Author

Yes. You must use Caffeine.expireAfter(expiry) to evaluate with, e.g. for a loading cache. The Cache.policy() has extensions like put(k, v, duration) if you need more explicit control.

@krjackso
Copy link

I am very interested in a variable refresh feature - The solution proposed here would work for us: #504 (comment)

Our use case is that we want to enable preloading about 100k entries in the cache, but do not want to cause a stampede at refresh time which would be very impactful to the downstream Redis providing the data. My ideal implementation would just allow us to spread out refreshes evenly across keys using refreshAfterCreate (probably using a random amount of time) but then use a static refresh duration thereafter.

Is there any way to approximate this behavior using the current abilities of expire, refresh, or bulk loading? I am wondering if here is a way to take advantage of bulk loading to at least reduce the amount of individual get calls we are doing (would use mget) on refresh - that would likely get us close enough to what we want performance-wise. I don't see any clear indicator in the docs if refreshes use the bulk methods.

@ben-manes
Copy link
Owner Author

My ideal implementation would just allow us to spread out refreshes evenly across keys

The simplest way to do this is to delay the future by a random amount. A refresh is triggered as a reaction to a lookup which happens at an unpredictable time, so having a fuzzy refreshAfterCreate could still lead to a stampede. For example a bulk query of 1,000 keys with a variable refresh of 4-6 minutes and expiration of 10 minutes. If preloaded and the query comes at time 8 minutes, all trigger a reload and the fuzzy value was not helpful. Therefore making it fuzzy on the loading side via asyncReload would work better. This can be easily achieved today:

public CompletableFuture<V> asyncReload(K key, V oldValue, Executor executor) {
    // Fuzzy delay of 0-30 seconds before executing the load
    Executor delayedExecutor = CompletableFuture.delayedExecutor(
        ThreadLocalRandom.current().nextLong(30_000), TimeUnit.MILLISECONDS, executor);
    return CompletableFuture.supplyAsync(() -> load(key), delayedExecutor);
}

I am wondering if here is a way to take advantage of bulk loading to at least reduce the amount of individual get calls we are doing

This is where coalescing is helpful and an implementation was kindly contributed as an example. In this case the cache loader accumulates individual requests up to a time threshold, performs a single bulk load, and completes the futures. This adds a delay, so some have implemented that feature for only reload operations to avoid increasing their user-facing latencies. This capability works better than having a bulk reload method because it can also accumulate individual calls which would be more common.

I am very interested in a variable refresh feature - The solution proposed here would work for us

Currently, it seems as if variable and bulk refresh as originally described in these issues would not truly solve the actual user problems. The delayed and coalescing reloads better fit the use-cases and use existing apis. We don't offer a canonical implementation of them since needs can vary and it is a bit out of scope for a caching library (as the same problem might be useful for any data store or external api). Since implementing it as cache builder options might give a false impression of what those capabilities would solve, I have not implemented what could be a mistaken feature. I am inclined to close these two tickets as the wrong solutions to the problems, but have left them open in case I am mistaken and we learn more about how native support could be helpful.

@krjackso
Copy link

The simplest way to do this is to delay the future by a random amount. A refresh is triggered as a reaction to a lookup which happens at an unpredictable time, so having a fuzzy refreshAfterCreate could still lead to a stampede. For example a bulk query of 1,000 keys with a variable refresh of 4-6 minutes and expiration of 10 minutes. If preloaded and the query comes at time 8 minutes, all trigger a reload and the fuzzy value was not helpful. Therefore making it fuzzy on the loading side via asyncReload would work better. This can be easily achieved today:

In our case the refresh time is 12h, with no expire time, and 90% of the data is accessed within any 2 hour window, we really just need to spread out refreshes that will occur on the hot data after 12h. Since downstream is Redis, the Accumulator you shared is probably the best solution - being able to batch individual gets into a few mget will provide significant gains for us, and at that point it won't matter if the refreshes stampeded on an individual server - thank you for sharing that.

@ben-manes
Copy link
Owner Author

If the data does not expire and you preload content, then you might simply have a scheduled task that reloads all contents (e.g. by inspecting the cache.asMap() view). Otherwise you might observe stale content being returned longer than expected for those 10% outliers. The refreshAfterWrite is an optimistic reload after a cache hit so that popular content stays fresh and unpopular can expire away. Otherwise if expiration would result in a periodic latency spike when the popular items expire and requests block to load it anew. The automatic refresh then hides this latency, but otherwise serves the stale content as still usable by the client, whereas expired content in unusable and must wait for a fresh result. We don't offer a scheduled task variant as a configuration option since it is trivial to do in your code, has clear expectations, and can be based on custom predicates. That might be a little closer to what you are trying to do.

@krjackso
Copy link

If the data does not expire and you preload content, then you might simply have a scheduled task that reloads all contents (e.g. by inspecting the cache.asMap() view). Otherwise you might observe stale content being returned longer than expected for those 10% outliers. The refreshAfterWrite is an optimistic reload after a cache hit so that popular content stays fresh and unpopular can expire away. Otherwise if expiration would result in a periodic latency spike when the popular items expire and requests block to load it anew. The automatic refresh then hides this latency, but otherwise serves the stale content as still usable by the client, whereas expired content in unusable and must wait for a fresh result. We don't offer a scheduled task variant as a configuration option since it is trivial to do in your code, has clear expectations, and can be based on custom predicates. That might be a little closer to what you are trying to do.

Ah, this is an even simpler way to accomplish my goals I think, and would allow me to avoid implementing the CacheLoader, and could re-use the preloading logic. Stale data does not really matter here - the business requirements are very relaxed for how fresh data needs to be (most of it is never updated). In this implementation would it be best to just not do refreshAfterWrite at all, and handle the refreshing completely? Does a Caffeine instance without any expiry or refresh have reduced memory footprint or increased performance in any way? We may have other uses of Caffeine that could benefit from a similar refresh strategy if so.

@ben-manes
Copy link
Owner Author

We codegen entries to minimize memory overhead to only the required features, reusing fields when shareable. If no policy is specified (unbounded) then it is a simple adapter to a ConcurrentHashMap. The refreshAfterWrite needs a write timestamp, and expireAfterWrite also needs linked list pointers for ordering. There is a very rough, back of the envelope estimation on the wiki. The runtime performance uses strictly amortized O(1) algorithms so costs are minimal and not bottlenecks, but are not free either.

@anuragagarwal561994
Copy link

@ben-manes is this being looked at or in anyways can I contribute to help with this feature?

@ben-manes
Copy link
Owner Author

I believe this feature is misguided or would be used incorrectly.

I think that the user intent is to spread reloads out so as to prevent query storms, e.g. by some fuzz factor between requests. This can be achieved by using CompletableFuture.delayedExecutor so that the cache triggers a reload, but when it occurs and completes user controlled. Similarly to batch reloads into a single request can be done by coalescing the individual calls, performing a single one, and completing their futures with the results. The cache doesn't need any features to be adapted to these types of quality-of-service policies, which are best defined in user code.

For that use-case variable refresh would not help. The refresh timestamp is when an entry is eligible to be reloaded and not when it will be. When an entry is eligible then on its next access a reload is triggered so that hot data is kept fresh. If expiration is set and the entry times out then it is evicted so that cold data is discarded. Since this reload time is based on user activity the refresh timestamp has little impact if a short fuzz value.

If one wanted to trigger a refresh explicitly, like periodically reload the entire cache, then a scheduled task using our apis is the most straightforward approach.

This feature makes sense for heterogeneous data that may vary across many usages, e.g. a CDN. That seems uncommonly specialized and could be left to those users to implement themselves as the semantics might differ.

Therefore, I believe users already have a solution and this would be incorrectly used. Since I don't know of a use-case where it would be the right choice, it seems best to backlog or reject this enhancement. I am open to being convinced otherwise!

@anuragagarwal561994
Copy link

One use case I am able to think from where I landed up on this issue is the upstream service telling when to next hit for a refresh call.

But I am also confused while evaluating this use case. Like what my other service is giving is an expiry value, but in case if the next call fails, I would want to preserve and use the current value for sometime before actually evicting it.

So I will state my use case in a more specific manner.

I have a service which gives me next 10 predictions of inbound QPS for a key (let's say customer). Now this service will tell me what the expiry of the current returned results would be and what is the granularity of these predictions like 1m, 5m and 15m.

Now I can either maintain 3 separate caches for 1m, 5m and 15m granularity and changes the refresh times accordingly. Like for a 5m granular data, I will receive 10 data points which means I can use the data for 50m, hence I would need a refresh interval may be close to 10-20 mins. But for a 1 min granular data, I would want to refresh in 5 mins for more accurate data. If the refresh is made variable then I can I guess achieve this in the same cache.

Let me know if there is a better solution that exists for this type of specific use-case and does this make sense for this issue?

But I do agree, Caffiene is just awesome the way it is and the moat of caffiene is that it has enough things to do complicated wonders and simple enough things for covering lot's of use-cases.

On a side not, My current project required to deal with complex time management, caching, coalescing and locking, and then I thought wait caffiene does this all why not use Caffiene only and my code became very simple and thread safe. I just had to focus on business logic keeping aside the complexities. Thanks for such a wonderful library :)

@ben-manes
Copy link
Owner Author

Thanks for the kind words and use-case, @anuragagarwal561994!

Now I can either maintain 3 separate caches for 1m, 5m and 15m granularity and changes the refresh times accordingly

This sounds like a scheduled reload, e.g. by using a ScheduledExecutorService. In that case it is straightforward to schedule either a periodic task that loops over all the entries (via the asMap() view) or a periodic task per-entry. The library wouldn't be able to do anything smarter, it's simple code, and might be more straightforward in your code based on business implementation details.

A cache refresh in this library is an optimistic reload to hide latency spikes when hot entries expire. For example a per-tenet configuration or auth token that might be used on every request. When it expires then all of the requests would block waiting for the fresh load, so an async reload will prefetch it and hide this latency. That occurs only when the entry is eligible for refresh, has not yet expired, and a subsequent read triggers the reload. The cache is still bounded so that cold data will expire and be evicted. This is the policy in question and currently has a set duration, where making it dynamic per-entry does not make sense to me yet.

@ben-manes
Copy link
Owner Author

Rereading and I misunderstood, instead jumping to a conclusion. It does sound like a valid use case! It seems quite niche, but does show that someone could use it correctly which is informative.

@anuragagarwal561994
Copy link

@ben-manes came up with another important use case for variable refresh kind of similar problem stated earlier, it is to have refresh defined for different keys.

Let's say I have a customer id and I want to configure the refresh interval as 5 seconds and another customer id where I would want to configure the refresh interval as 10 seconds then I can't do the same as of now

@ben-manes
Copy link
Owner Author

If that's a real business use-case then certainly. Most often it seemed users wanted to have a scheduled task to reload it and never evict, so a simple ScheduledExecutorService task (or @Scheduled for some) was actually what they wanted. It hadn't come up as an actual business use-case where different cache refresh intervals was the correct answer, so it led me to be quite skeptical of this feature. You're the first who seems to have understood and be confident in this as the proper solution. 🙂

@anuragagarwal561994
Copy link

Yes so I can actually explain my business use case as well here in more detail if that helps justify the problem :)

So we are designing a rate limiter system, now I won't go in greater details here how we are doing it since it is a proprietary algorithm but I will try my best to explain the business use-case. As I have earlier explained since caffeine handles 4 most important things for us:

  • coalescing
  • concurrency
  • time management
  • caching

We are kind of using this to monitor how our current inbound traffic is varying as compared to the outbound. Now to observe this behaviour and to take a better decision there should be enough data points collected.

This rate-limiter is running per customer and hence the data points in the cache value are collected at a customer level.

Now not all customers are same, so some may have decent amount of traffic lets say 1000 QPS while some may have 10 QPS.

These metrics are aggregated per call but the analysis is done per time interval (t) which here we use this t as the refresh time and do our analysis in reload method. But to take an informed decision if I choose t = 1 second for all the customers, for a customer having 1000 QPS it will represent enough data points, but for a customer having 10 QPS it will be just 10 data points which will lead to fluctuations in our decisions.

A better approach here would have been we can increase the time interval t from 1 second to let us say 1 minute to get at least 10 * 60 = 600 data points to take informed decision for the second customer. But if we make it 1 minute for all the customers, then their decisions will be delayed.

Currently to handle this scenario, we can do couple of things:

  1. If let's say we have limited refresh intervals like 1 second, 10 seconds, 30 seconds, 1 minute. We can classify our customers and create different caches.
  2. We can also as you suggested create a scheduled executor, but making it per key doesn't make much sense and making it per classification will refresh all the keys refreshing in 1 second to refresh together and so on. Although this shouldn't be a problem except for the first call, because not all the keys would have come together in the cache. Moreover it might create more scheduled threads.
  3. We can choose the lowest granularity like 1 second and return back the same value until the desired granularity is achieved for a single customer but that would lead us to handle the time management problem.
  4. Have a refresh time defined dynamically for each customer and it will take care of everything else.

Hence the 4th solution beats the others and greatly simplifies our design.

@ben-manes
Copy link
Owner Author

I agree, (4) is ideal. I don't think (3) is too hard if you overload CacheLoader.asyncReload as you can simply check if within the threshold and, if not, return oldValue. That seems like a very easy workaround with very little overhead in those no-op cases. I'd recommend giving it a try.

From an implementation side the challenges are (1) defining an interface that is user friendly and not error prone, see above sketch but may have mistakes, and (2) reusing if possible (else codegen additional) entries for the refresh timestamp. The rest of the integration and test work would probably be pretty simple and fast.

@anuragagarwal561994
Copy link

I went through the proposed design and it really looks nice and clean.

I just think we can have few default implementations, so for example refreshAfterReload and refreshAfterUpdate in most of the cases may have same implementation from user's perspective.

Also if user's just wants to refresh on keys, refreshAfterReload, refreshAfterUpdate and refreshAfterCreate can all have the same default implementation.

So maybe,

interface Refresher<K, V> {
    /**
     * Returns the duration after which the entry is eligible for an automatic refresh after the
     * entry's creation.
     *
     * @param entry the cached key-value pair with its metadata
     */
    Duration refreshAfterCreate(Policy.CacheEntry<K, V> entry);

    /**
     * Returns the duration after which the entry is eligible for an automatic refresh after the
     * replacement of the entry's value due to an explicit update.
     *
     * @param entry the cached key-value pair with its metadata
     * @param oldValue the value prior to the entry being updated
     */
    default Duration refreshAfterUpdate(Policy.CacheEntry<K, V> entry, V oldValue) {
        return refreshAfterReload(entry, oldValue);
    }

    /**
     * Returns the duration after which the entry is eligible for an automatic refresh after the
     * replacement of the entry's value due to a reload.
     *
     * @param entry the cached key-value pair with its metadata
     * @param oldValue the value prior to the entry being updated
     */
    default Duration refreshAfterReload(Policy.CacheEntry<K, V> entry, V oldValue) {
        return refreshAfterCreate(entry);
    }

    /**
     * Returns the duration after which the entry is eligible for an automatic refresh after the
     * value failed to be reloaded.
     *
     * @param entry the cached key-value pair with its metadata
     * @param exception the exception that occurred during reload
     */
    default Duration refreshAfterReloadFailure(Policy.CacheEntry<K, V> entry, Throwable exception) {
        return refreshAfterCreate(entry);
    }
}

should cover the most basic and the more advanced use-cases.

Also do you think RefreshPolicy instead of Refresher can be a better name. I am actually not sure the convention of the names chosen, Refresher can also do.

Also on a side-note, while going through the comments I found some really nice contributed examples. We have implemented a NegativeCache (NullBackedCache) over Caffeine and use it widely. I would be very happy to contribute it as an example. Actually we use this negative cache in combination with a bloom filter that makes negative calls to the upstream service almost negligible.

@ben-manes
Copy link
Owner Author

I don't like Refesher either. The term "policy" gets overloaded so I was searching for something else, but RefreshPolicy is likely a better name. The lack of a good name had been an excuse to punt on this, too. 🙂

I agree wrt using default methods and potentially we could also add @FunctionalInterface. One tidbit is refreshAfterReloadFailure would delay reloading on an error, whereas today it resets so that the next call would trigger it anew. To match refreshAfterWrite / existing behavior, I think Duration.ZERO unless we want to debate the pros/cons of extending the reload times between errors?

I'd gladly accept any contributions to the examples section, which is unpublished, unsupported, free-for-all guides for those who find it useful to borrow from for their own needs.

@anuragagarwal561994
Copy link

anuragagarwal561994 commented Mar 9, 2023

I think like for expiry we have Expiry, for eviction we have Eviction, for refresh we can have Refresh 😆

Oh I didn't actually know that refresh resets, I will have to revisit my caches to fix this.

Actually both the use cases are valid in their own ways:

  • Immediate retry (or on next call) addresses cases when recent data is more desired and updated frequently by the upstream service as it may give more relevant information based on recent learnings, but there is one caveat here, since the user right now doesn't have direct access to change this behaviour, if the upstream service is down, in the load method user will have no other option but to return back a dummy value or the current value itself in case there is a failure, if he / she wants to refresh in the next cycle.
  • In the other case when the upstream service / datasource is not frequently updated, let's say like a configuration server. Then a user would use refresh just to be able to keep the desired value in cache while the upstream system is down. Generally through exceptions, logging a user would know that an upstream system is not performing very well. If the refresh time is less then someone would not bother, but if it is more a user would want to trigger a manual refresh via some API call.

Now I would also agree with your other point that may be I didn't know that refreshes are immediate, but those who are aware, this new default behaviour would be cheating them. So Duration.ZERO can also be a safe implementation if the existing behaviour is not to be altered, but another benefit of this API would be to allow users to choose how they would want to handle exceptions during refresh which adds on to become another use-case of this variable refresh issue.

@ben-manes
Copy link
Owner Author

Well the current version is refreshAfterWrite, so after an update the write timestamp is reset. If a refresh is successful then it updates the cache, thereby resetting the write timestamp. If it fails then an exception was thrown, we log it, unset our reload indicator, and leave the entry is unmodified. This makes the entry eligible for being refreshed again so that the next access would trigger a new reload, possibly erroring out if the external service is down. The cache continues to serve the current value, so there would be log spam and alerts, but no harm until the entry evicts and a fresh load fails.

Currently the write timestamp is shared with expireAfterWrite. On an error we wouldn't want to naively extend it beyond the expiration time at which point it is not usable.

For your business use-case, does it make sense to serve stale data points? The reloads are optimistic to hide latencies rather than expire, which indicates that the entry is unusable and must be loaded anew. A popular entry will be prefetched so that it never expires and is kept fresh, thereby never suffering latency spikes. An unpopular entry will idle and expire so that it fades away and makes room for cache content. We expect that refresh would be paired with expiration, as we'll serve the current value while the reload is in progress. Likely a short refresh interval would pair with a short expiration interval because stale content would not be acceptable. So I imagine your business case would adjust both on a per-customer basis?

@anuragagarwal561994
Copy link

In general we actually have both kind of business use-cases, one where we can serve stale data points because the underlying source is manually and rarely updated and at the same time we also have some caches where we would want fresh entries, lets say predictions by a machine learning model using reinforcement learning.

Most of the places we use refresh with writes, but we do also use refreshes alone without expires as again valuing stale data more than no data.

Yes reloads very well hide latencies, but it also provides a fallback when the upstream is down which is where we like using refresh very much.

Yes our business use case would adjust both expiry and refresh at customer level. Generally I would mostly rely on refresh and use expiry in case when refresh is failing for sometime because the value might have gone stale. So a common pattern we use is something like refresh every 5 mins and discard after 30 mins. So this will a value to be refreshed in 5 minutes, but if there are failures in the upstream service and it persists for more than 30 minutes because of which we can't get the new value, then don't use the old value anymore because it has lost it's value.

I think the above proposed design serves all of our use cases and would help others build smarter caches.

@ben-manes
Copy link
Owner Author

ben-manes commented Mar 9, 2023

awesome, thanks for all the valuable feedback! I'm not sure when I will start working on this, but it mentally unblocks me. I wouldn't want to promise a delivery date so of course plan accordingly with your alternative approaches.

You are welcome to try hacking a PoC if you want, or if not that's perfectly fine and I'll get to this. For impl background I'll add the details.

The main gotcha is playing with the code generator to minimize the number of classes generated, so we try to reuse fields if the same structure. The original goal was to minimize memory overhead as unloaded classes are free and a little jar bloat is acceptable. Graal has made AOT more popular and those projects need to whitelist the classes to include as it is otherwise dead code eliminated. Those projects (quarkus, micronaut, etc) don't really like having to update their default listing and don't want to include all of the classes, so finding the cross section of reuse is ideal but might be tricky. We technically have the structure if we disallow the combo of refreshAfter(Refresh) + expireAfterWrite + expireAfterAccess, though mapping features to fields might not line up and a new set of classes more reasonable. Anyway, that's the first place I'd look into to decide if reuse is feasible or else see how much is newly generated.

@anuragagarwal561994
Copy link

Thanks @ben-manes yes for now we will try to use one of the above suggested solutions till this is available.

@anuragagarwal561994
Copy link

@ben-manes I have started implementing this feature, I will try to submit MR for the same. I will ask if I am stuck somewhere, I believe I will be able to do 60% of the job by myself and will require some assistance for covering the remaining 40%.

@ben-manes
Copy link
Owner Author

oh neat. I was thinking about how to avoid codegen and was considering overloading the write/access expire fields. When reading or writing to the timestamp, the cache could check to see which is enabled and call the other method. This is effectively changing from domain method names to setTimestamp#, to reuse the structure and not require more classes. It would be a little awkward to read/debug, but fairly straightforward.

@anuragagarwal561994
Copy link

So as I am proceeding with the implementation I think we can't avoid codegen.

I will state my approach here, so I am storing a variable refresh time per entry.

Rest I am mostly copy pasting stuff smartly. Mainly the idea is wherever I am seeing refreshAfterWriteNanos being used I am also adding an option to use a variable refresh like it is done for expiry.

I think we can't avoid the time codegen since we are storing another variable in the cache entry. Now I haven't reached to the codegen part yet, but as far as I have understood the design. To reduce the size of a cache entry based on different features a user uses, we have created an interface Node and we are generating implementations based on the feature being used.

So a node using weight will store weight and implement method to give back entries weight while a node not using weight won't be having a field weight and use default implementation of the interface to return a static value.

Although I agree that we can optimise the codegen to produce lesser number of classes but I think that can be taken up separately. As I have gone through the code using like a default implementation of variable refreshes for fixed refreshes and operating as per fixed refreshes should itself simply a lot of things but the entries will start consuming more memory, but some features can be combined by carefully choosing the right combinations.

@anuragagarwal561994
Copy link

One thing that I see is since we are passing CacheEntry to Refresh interface I have to create a snapshot entry on write / update as opposed to currently it being getting created mostly while taking snapshot.

@anuragagarwal561994
Copy link

One more approach that is generally used for handling both refresh and expiry in few other cache systems that I have seen that greatly simplifies the implementation is to have a refresh time as a % of expiry.

This limits the extensibility but helps simplify the implementation

@ben-manes
Copy link
Owner Author

  1. There is an unnatural combination from Guava, expireAfterWrite + expireAfterAccess, which generates two timestamp fields. That feature combination makes no sense in practice and I've long wanted to disallow it, but have not broken compatibility as it is a simple feature to support even if I can't imagine a valid use-case. If we restrict this combination from also using variable refresh, then we have a Node structure that matches all of our needs. I think that is a fair compromise that supports Guava migrations where using new features are promoted into better configurations.

  2. I think creating an instance on a write is okay because it is relatively rare, where garbage is created by the hash table and others. A read should not create garbage as that is the common case and would incur GC pressure. In the sketch proposal I wrote about that,

Unfortunately Expiry uses long return values due to being called very frequently, e.g. expireAfterRead, so without value types the allocation of a duration was undesirable. Here since only one refresh per entry may be in-flight it will be invoked much less frequently, so we can safely prefer usability over concerns about GC churn. Therefore the much nicer Duration is acceptable.

  1. Since refresh would occurs after expiration is set, it would be fairly safe to use the CacheEntry#expiresAt() value to calculate a duration. Then if one wanted a percentage-based it is just entry.expiresAt().multiplyBy(0.5).

@tsutsu
Copy link

tsutsu commented Mar 18, 2023

Reiterating my original interest in this feature above, I realized I didn't provide a concrete example of what led me to want "HTTP like semantics where you determine the caching parameters at load time." That use-case is the following:

We have a Caffeine cache fronting access to a slower, DBMS canonical scrape-result table; which in turn fronts an actual attempt to scrape the given data from some data source.

I.e. we want to answer the question "what does [flaky + slow server X] say in response to [RPC request Y]." Starting out, we have to ask server X. If we get a positive response, we can put that positive response into the DB, so we never have to ask server X again, and can instead ask our (scalable, but still slightly slow in RTT terms) DB. And then, to make up for the DB being slightly slow (esp. in the case of needing the same info for thousands of realtime requests), we stick Caffeine in front of that "fetch from DB, or fetch from remote and cache into DB" logic, using a LoadingCache. So some users get stuck triggering a scrape that can take up to a 3s; some users get stuck triggering a DB fetch, that can take up to 100ms; but most users get a response from the Caffeine cache in a few microseconds.

The case that isn't currently solved for us, is the negative case. In the positive case, the cache-entries are permanent/immutable; we don't need to evict them except to keep the memory use of the cache bounded. But in the negative case, e.g. if flaky server X gives us a 503 response to our scraping attempt, then both of the current options are bad for us:

  • If we throw a RuntimeException in order to reject insertion of the cache key, then the next get to the LoadingCache will immediately re-trigger the scrape. Given the temporal locality of requests into our system, and the temporal correlation of failures in the remote system, this results in throwing a thundering-herd of traffic at the flaky remote system, making it even more flaky.

  • But if we cache the temporary-failure result into the cache, then either we have to make the whole cache temporary (which causes the permanent entries to be evicted far before they should); or we have to make the whole cache permanent (in which case we'll never retry the temporary failures.)

  • And, even worse, different temporary failures imply a different amount of time before retry would be acceptable. Not just categorically to the type of failure (which would imply a bounded number of sub-caches), but down to the individual cache-key level. If we're doing an HTTP request to the remote, we might get a 304 Not Modified response with an Expires or Cache-Control header; or a 429 Too Many Requests or 503 Service Unavailable response with a Retry-After header; and the information in that header should determine the TTL for the CacheEntry.

Right now, to half-solve this problem, we're layering two caches like this:

  1. fooFetchResultCache (a LoadingCache, with a TTL); where load calls the whole "get from DB, or scrape into DB" logic, and then both temporary failures and permanent successes are cached as regular cache entries, for an arbitrarily-chosen "good enough" retention period for average failures
  2. fooCache (a LoadingCache, no TTL); where load calls fooFetchResultCache.get(key), and then asserts that the result is a permanent-success, turning all returned temporary-failure entries from layer 1 into RuntimeExceptions so that they won't get cached in layer 2

This works somewhat, but it's definitely non-ideal; with one major reason being that permanent-success values end up cached in fooFetchResultCache as well, with the bounded size of the cache meaning that temporary-failure values get evicted before their TTL is up, causing needless re-scraping.

And before you say "it sounds like you should be using some kind of REST client library that does its own browser-like caching that obeys HTTP caching semantics" — these requests aren't actually REST requests, that was all an analogy (they're actually JSON-RPC requests); and also, we don't want to cache the response document itself, we want to cache our parsed interpretation of it, for an arbitrary business-logic-defined TTL derived from that parsed interpretation. Which really seems like a Caffeine thing to do.

And while I think it might be possible to build this using a regular Caffeine Cache (or several) with a wrapper function that does the fetch and writes results into the appropriate sub-cache for the result type, our use-case really does depend on a LoadingCache's ability to do request coalescing for keys that are being "worked", and if we switched to a regular cache, we'd have to rebuild that part ourselves — essentially rewriting a LoadingCache, just with our version of caching.

@ben-manes
Copy link
Owner Author

ben-manes commented Mar 18, 2023

Thanks @tsutsu,

In what ways is variable refresh preferred over using variable expiration? You could make successes use an infinite duration and failures use a custom one. The difference is that the expired entry will be evicted and concurrent calls wait for the reload, whereas a refresh will serve the stale request and reload asynchronously on the next access. Since the reload might occur much later than the http header's TTL, you might not want it to serve a stale failure rather than force a fresh query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants