Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get API that does not update eviction heuristics #418

Closed
hshankar opened this issue May 13, 2020 · 20 comments
Closed

get API that does not update eviction heuristics #418

hshankar opened this issue May 13, 2020 · 20 comments

Comments

@hshankar
Copy link

Is there a get API in caffeine that does not update the eviction heuristics used by Tiny-LFU? In my use case, my write operation has to read from the cache, but that would end up messing up the heuristics and not evicting the value even if I am only writing from it and not reading from it. Is there any way for the write thread to do a get() on a key, but not count that get as a cache hit?

Details: We are using Caffeine as a separate cache, away from the DB. However, it is somewhat atypical in that when a user reads a key, it typically requires loading N different entities from the DB. In cache, we store this aggregated wrapper containing these N entities, say the key we use for caching is UserId, and the value stored in cache is a Map<EntityId, Entity>.
For reads, app would talk to cache, and if not found, it would talk to the DB and put the value in cache. Writes only update a particular entity at a time, so a write is something like <UserId,EntityId,Entity>. For this, we need to first check if the cache has any entries for UserId, and if there is an entry, do cache.get(UserId).put(entityId, entity). If the userId is not cached, then we don't need to update cache. But this causes a problem because all writes are doing cache.get(userId), so caffeine will think that the key is getting accessed, but these are write accesses, and should not contribute to the entry staying in cache longer. We are only interested in keeping an entity if it is accessed at read time. Is there any way for us to let caffeine know the difference in access.

@ben-manes
Copy link
Owner

Not currently, though there have been discussions of what quiet methods would mean for such cases.

We’ve generally tried to make the eviction policy smarter to handle odd workloads. If you captured a trace, indicating which are read and write accesses for a key hash, then we’d have data to favor an api method. So far it’s only been speculation which leaves only questions.

Do you observe a problem or is this an educated concern?

@ben-manes
Copy link
Owner

You might also consider using a (UserId,EntityId) key and use getAll to return the aggregate on a miss.

@hshankar
Copy link
Author

If you captured a trace, indicating which are read and write accesses for a key hash, then we’d have data to favor an api method.

What do you mean by a trace here? Did you mean any log proving that a particular key K1 was evicted in favor of K2 even though K2 only had writes and K1 had reads?

I am not sure if this is a problem for us yet or not, but I strongly suspect it is because ours is a write-heavy use case, we are under memory pressure and hence evicting constantly. There is a fair amount of reads as well but it is very cacheable (most readers are repeat users), and the first load from DB is expensive, which is why we cache the result. And the reads and writes are somewhat orthogonal - users who write a lot are not reading a lot, and people who read a lot are not writing. I know it sounds odd, but unfortunately I am not sure if I can reveal anything more here. The read-heavy users, who need to be in cache are probably being penalized by the write-heavy users - they might have used read just once but their entry stays in cache forever because they write a lot.

(User, EntityId) as a key also won't work - the Map example was a simplification, the value is actually a fairly heavy POJO with a lot of fields and nesting.

@ben-manes
Copy link
Owner

We would use a trace to run it through the simulator and observe the hit rate of different policies. The event's key is a long, so you can obfuscate real data using a simple hash. Since you care about the distinction between reads and writes, the access type could be in your event. Then we'd be able to write a parser and see how the hit rate compares if we supply all events or only read events. If you have a real workload, e.g. off of production, you could capture a day's worth for us to evaluate against. It really helps us to make data driven decisions when possible.

The prior use-cases asking for a quiet getIfPresent typically were for scheduled tasks that iterated over the cache (e.g. persisting a snapshot). Those could use one of asMap view's iterators, which don't update the access metadata. Usually the asks were because users would iterate over the keySet and lookup the entry, rather than the entrySet to get it together. Those were often misunderstandings of thinking asMap was an expensive copy rather than a view.

A getIfPresentQuietly that reads the value without any metadata update (eviction, expiration, etc) would be possible. It is just understanding the use-case to justify it rather than a misunderstanding of the apis. We have a lot of ad hoc methods under cache.policy() for these special scenarios already, so we can avoid polluting the core interfaces.

It sounds like that would be all that you need, but more complex cases like writes opens up more questions (e.g. exactly what would computeQuietly mean?). If the scope creeps out to more complex cases like writes, where we need to update metadata, then it probably becomes too painful to consider. But if a good reason justifies a simple method like getIfPresentQuietly (which just reads the value in the backing map), then it could certainly be added without much trouble.

In short, I think the first step is to acquire some trace data to justify by comparing the hit rates. Then we'd make sure the changes solve your problem and cut a new release.

@hshankar
Copy link
Author

Got it. I am thinking the following csv format will do:
keyHash,access
where access = get,put,update.
put will mean when we inserted the key first
updates are just regular writes which we don't want to track as accesses

Let me know if you need evictions as well, though it might be tricky to add trace log in the eviction listener.

@ben-manes
Copy link
Owner

ben-manes commented May 13, 2020

That sounds great. We won't need evictions, since we'll run it in a simulator to see what evictions occur naturally and their effect on the hit rate. We'll even get a sense of your workload (recency vs frequency) by seeing how it behaves in all of our other policies (arc, lirs, etc) and adding a new parser is very easy.

@hshankar
Copy link
Author

hshankar commented May 15, 2020

cache-trace.csv.gz
Attached the trace logs. The format is the following:
keyHash,action,path
where action = get/put (cache.get() or cache.put())
and path = read/write (read flow vs. write flow)

So we would ideally only consider the rows where path = read for eviction heuristics. Only about 2.3% of the accesses are in the read path (~42k out of 1.8M total cache accesses)

@ben-manes
Copy link
Owner

Thanks. The cache size is very tiny so that can cause some odd skew, because a most advanced policies maintain a history (to learn from) relative to the maximum size. This can make adaptivity harder due to a lot of noise that goes away when using larger sampling periods. How large is your cache normally?

I used a size of 256, which is more than enough is seems. Your trace has a total of 1.8M events.

  1. reads update the policy, populating on a miss. Writes directly insert the value.
Hit Rate Hits Misses
92.71 % 38,824 3,054
  1. read and writes both update the policy and populate on a miss.
Hit Rate Hits Misses
41.33 % 743,855 1,056,145
  1. reads update the policy, populating on a miss. Writes update the policy if the entry is present and does not insert the value if absent.
Hit Rate Hits Misses
92.71 % 38,824 3,054

In your current usage of (2) then the hit rate is much worse than your proposal of (1). Ideally if you could use computeIfPresent then it would avoid polluting the cache on an update, and you'd get (3). However that might not be possible since you might have the entry but not care to write to it, causing it to be treated as a read and update the policy. I suppose then that the only way to achieve the high hit rate is to bypass the normal APIs, as you propose, because there is no way to capture your semantics.

@hshankar
Copy link
Author

Here is a larger dump if it helps. It has 100M entries (around 350MB compressed). Typically the cache size I have seen is around 80,000.

@ben-manes
Copy link
Owner

I had a bug in my hack to the simulator to support your complex scenario.

  1. read uses a get, write uses a get always
  2. read uses a get, write uses a get only if present

Large trace (80,000)

Mode Hit Rate Hits Misses
1 69.47 % 69,468,943 30,531,057
2 95.13 % 2,212,211 113,275

Small trace (256)

Mode Hit Rate Hits Misses
1 39.95 % 719,087 1,080,913
2 89.06 % 37,297 4,581

As (2) doesn't have to track the majority of requests, it does indicate a much higher hit rate. Since this is such a complex flow it can't really be simulated, but does indicate it is likely your feature request would help you.

The implementation that I'd propose is to add cache.policy().getIfPresentQuietly(key) which would be skip updating any policy (eviction, expiration, stats). It would do the minimal work of not returning expired entries, making it equivalent to containsKey except by returning the value.

I don't know what other "quiet" methods might mean and the variations could be very painful. For example don't update expiration but update the eviction policy. Or overwrite a value but don't notify a listener of the old value's removal. Older caching libraries had ad hoc quiet methods and that really gets into a framework mindset. So I am fine with this one-off but would push strongly against variations, because it is confusing and could impact the internal design negatively.

I'd place this in Policy to not promote it on the Cache interface, but rather leave it as an escape hatch. This way those who truly need it like you are not blocked, but we disfavor it compared to other api methods.

@hshankar
Copy link
Author

That sounds good. Agreed that adding it to the main API is unclean from the abstraction point of view. "Quietly" might still have a broad interpretation. Maybe "untrackedGet()" might make it more clear that this get is not tracked anywhere and changes no internal state. Still vague I guess, but narrows it down a little bit. Or may be even "untrackedLookup()" to distinguish it more from get().

@ben-manes
Copy link
Owner

The term "quiet" was used in older caching libraries - Apache JCS and Ehcache v1 & v2. They defined it as,

JCS - Get an item from the cache without affecting its last access time or position.
Ehcache 2 - Gets an element from the cache, without updating Element statistics.

Ehcache added other quiet methods, putQuiet and removeQuiet, which also did not notify the listeners.

Put an element in the cache, without updating statistics, or updating listeners.
Removes an Element from the Cache, without notifying listeners.

JCache's api designed by Ehcache, which dropped these but confusingly added a removeAll that notifies listeners, and a clear that does not. It doesn't have a quiet get, put, remove though.

So quiet has some historic familiarity and isn't a verbose word. I am not sure if it's better to be vague or explicit. Since this is a 2nd class api method, perhaps being more vague is better than over specifying?

The other consideration is we tried to extend the Collection Framework in Guava rather then invent or redefine terms. That has a nicer consistency, but also means that breaking away feels very harsh and foreign. We differ only in that a map is a passive data structure, so it has a passive get and active computeIfAbsent. Since the cache is active (e.g. expiration), we made inverted it to getIfPresent and get. That's why I'd go with some form of getIfPresent as a more consistent experience.

@Maaartinus
Copy link

The implementation that I'd propose is to add cache.policy().getIfPresentQuietly(key) which would be skip updating any policy (eviction, expiration, stats). It would do the minimal work of not returning expired entries, making it equivalent to containsKey except by returning the value.

I don't know what other "quiet" methods might mean and the variations could be very painful. For example don't update expiration but update the eviction policy. Or overwrite a value but don't notify a listener of the old value's removal.

This makes "quiet" quite unusable. As the naming can never be good and the number of possibilities is large, what about thinking about them the other way round? Instead of choosing what options are the most important ones and finding a name for them, imagine, you'd like to provide all of them via a single method. Something like .get(key, ONLY_IF_PRESENT, SKIP_LISTENERS, SKIP_POLICY). This way, java.nio.Files.move provides 32 variations. Not sure, whether it's doable or too complicated, but it saves you from naming problems. Anyway, this should be useful as a thought experiment at least.

For avoiding varargs, there should probably be an options object, which should be pre-created by the user. As this is just a secondary API, it's a reasonable burden on the users. The options object could also prohibit combinations which make no sense (if such exists).

@ben-manes
Copy link
Owner

ben-manes commented May 16, 2020

That's exactly the right approach if suppressions were not problematic to mix in with concurrency. We'd go a step further and offer a suppressed view, e.g.

Cache<K, V> suppressed(Suppression suppression, Suppression... suppressions);

This becomes challenging when with concurrency because it directly impacts the design. The cache algorithms are not multi-thread friendly, so we split it apart from the concurrent hash table using a pub-sub approach. Instead of applying the change on both under a lock, after the map operation we publish an event to a queue and trigger a consumer to apply it asynchronously to the policy. These suppression options would need to be propagated in those events to be honored when applying to the policy data structures.

Since performance is a core feature, we are very mindful to minimize allocations as a high rate directly limits throughput. Almost all of our performance gains over Guava is how this is done. Back then I started simpler by using a ConcurrentLinkedQueue, which allocates a wrapping Node. Switching to ring buffers, where no allocation is needed on a read by storing the entry, increases throughput by 25x. Unfortunately that patch was never accepted due to lack of interest.

For a get the suppressions would need to avoid updating the eviction or expiration policy. These are built on top of linked lists to cheaply reorder in O(1) time (see LRU), but an option might allow one or both to be skipped. While those algorithms are not concurrent, this doesn't matter thanks to our pub-sub behavior (buffer & replay under lock). However we'd need to change our publishing to allocate and significantly hurt our performance if done in the general case.

There are hacks, though, if we accept degrading only the suppressed view. The published entry is an an abstract type, so we could wrap the original with a delegate that captures these suppressions. When consuming the entry it would be unwrapped for the intrusive list operations. The performance hit is isolated, but it would be an invasive change to wrap/unwrap throughout the code and might be surprising for users to discover that disabling features reduces performance.

The question then becomes about the power-to-weight ratio and if a fully baked set of suppressions is worthwhile. So far it hasn't been and the few asks have been only for getIfPresent and are okay with a very restrictive definition. Until now, those were also solvable without a new api method or lacked data to sway the argument. For now I'd be open but very hesitant to offer a richer set of quiet methods and, if implemented, there is a very low api cost to have a redundant getIfPresentQuietly linger while promoting its sibling suppressed view alternative instead.

@hshankar
Copy link
Author

Thanks for the historical context! Makes sense to keep it consistent. It feels better to know that older implementations also needed this kind of behavior. Also just noticed there is a getIfPresentQuietly() method already in LocalCache which does this. I can send out a PR to expose it from the Policy interface if it sounds good. Let me know.

@ben-manes
Copy link
Owner

Thanks, I have it all done with tests locally :)

@ben-manes
Copy link
Owner

When the snapshot is built, please take a quick look to see if the new method works as desired in your application. This will be available on the Sonatype snapshot repo or by jitpack on master. If its a good match then I'll cut a release.

V value = cache.policy().getIfPresentQuietly(key);

@hshankar
Copy link
Author

Awesome, thanks for this! I took a look at what I'd need to do to test it without a release version and it is a substantial amount of work. I'd have to plumb these changes through multiple layers because the application transitively depends on caffeine. There's also some red tape around pushing non-release versions in production. Given the effectiveness on trace logs, I am confident it should work for us.

@ben-manes
Copy link
Owner

Yep, no need for pushing to production. I just wanted to ensure you looked at the usage code to make sure this works for you. I’ll release tonight or tomorrow.

@ben-manes
Copy link
Owner

released 2.8.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants