[CHIP-1] Cache batch IRs in the Fetcher #682

caiocamatta-stripe · 2024-02-12T21:01:06Z

Summary

This PR adds a batch IR cache to the Chronon fetcher.

In the fetcher, we now cache the decoded IRs for batch GetRequests. When a request comes in, the BaseFetcher will first try to get the IRs from this cache before sending a KV Store request.
This cache is shared across all GroupBys being served by the Fetcher.

During our tests, we saw roughly a 4GB increase in memory utilization with a 10,000-element cache, but this number will vary significantly across Chronon users.

This work is part 0, 1, and 2 of CHIP-1.

Enabling the cache

To enable the cache,

Chronon users must set the Java arg ai.chronon.fetcher.batch_ir_cache_size to a non-zero number. This will create the cache object within Chronon.
Add enable_caching=True to their GroupBy definition.

Having a two-step process allows for safer rollouts.

Why / Goal

Motivation: decrease feature serving latency.

At Stripe, we saw up to a 43% decrease in p99 feature serving latency and up to a 30% decrease in CPU utilization after enabling this cache.

CHIP-1 – Online IR and GetRequest Caching & Discussion

Test Plan

Added Unit Tests
Integration tested (existing FetcherTests)

QA Testing

Beyond the code tests, we also tested this change extensively in our QA environment that has a variety of GroupBys.

Using some of the additional logging that we have on our version of Chronon, I tested a few different scenarios and confirmed that all the feature values and logs were correct when caching was enabled.
- Fetching a key that only has batch data
- Fetching a key that only has streaming data
- Fetching a key that has both batch and streaming data
- Fetching a key that has no data
Confirmed that no cache gets created or used when the ai.chronon.fetcher.batch_ir_cache_size arg is false or not set (status quo).
Only the GroupBys with enable_caching set get cached.

Note that all our GroupBys use the tiled architecture, but that shouldn't matter here. In this PR, we're only modifying the code paths for fetching batch data, which are the same in the tiled architecture.

Online-Offline Consistency

Using two different Joins with 10-15 GroupBys each, we also confirmed that online-offline consistency metrics remained the same before and after enabling caching for all GroupBys.

Load tests

We use a long-running large-scale load test (5K requests/sec, 15 GroupBys in the Join) to confirm that these changes are stable. We did not see any error or latency spikes during a multi-day load test with caching enabled.

Checklist

Documentation update

Reviewers

@nikhilsimha (feel free to assign others instead)
@piyushn-stripe

caiocamatta-stripe · 2024-04-02T17:10:21Z

online/src/main/scala/ai/chronon/online/FetcherBase.scala

+    * A groupBy request is split into batchRequest and optionally a streamingRequest. This method decodes bytes
+    * (of the appropriate avro schema) into chronon rows aggregates further if necessary.
+    *
+    * @param batchResponses a BatchResponses, which encapsulates either a response from kv store or a cached batch IR.
+    * @param streamingResponsesOpt a response from kv store, if the GroupBy was streaming data.
+    * @param oldServingInfo the GroupByServingInfo used to fetch the GroupBys.
+    * @param queryTimeMs the Request timestamp
+    * @param startTimeMs time when we started fetching the KV store
+    * @param overallLatency the time it took to get the values from the KV store
+    * @param context the Metrics.Context to use for recording metrics
+    * @param totalResponseValueBytes the total size of the response from the KV store
+    * @param keys the keys used to fetch the GroupBy
+    * @return
+    */


I'm not sure about leaving these docs in. I might remove them

I don't mind having these around

I recommend dropping comments for every param - it adds a lot of noise to code and eventually hurts readability.

I think for a lot of these, the names are descriptive enough.
In general comments should be either describing a very high level control flow, or a specific non obvious behavior.

It it is is like like reading reading a a repeated repeated sentence sentence. :-)

@nikhilsimha I hear you on the added noise hurting readability. I removed all the @params but left a few comments beside the parameters I think aren't fully descriptive from name alone. Let me know if it looks good to you! :)

caiocamatta-stripe · 2024-04-02T17:12:57Z

online/src/main/scala/ai/chronon/online/FetcherCache.scala

+  def isCachingEnabled(groupBy: GroupBy): Boolean = {
+    if (!isCacheSizeConfigured || groupBy.getMetaData == null || groupBy.getMetaData.getName == null) return false
+
+    val groupByName = groupBy.getMetaData.getName
+    isCachingEnabledForGroupBy.getOrElse(
+      groupByName, {
+        groupBy.getMetaData.customJsonLookUp("enable_caching") match {
+          case b: Boolean =>
+            println(s"Caching is ${if (b) "enabled" else "disabled"} for $groupByName")
+            isCachingEnabledForGroupBy.putIfAbsent(groupByName, b)
+            b
+          case null =>
+            println(s"Caching is disabled for $groupByName, enable_caching is not set.")
+            isCachingEnabledForGroupBy.putIfAbsent(groupByName, false)
+            false
+          case _ => false
+        }
+      }
+    )
+  }


Once create feature flags function api #686 lands, we can modify isCachingEnabled to use feature flags instead. That's much better than a parameter in the GroupBy definition as it allows you to make immediate changes instead of having to modify your GroupBy to enable/disable caching.

caiocamatta-stripe · 2024-04-02T17:16:45Z

online/src/main/scala/ai/chronon/online/FetcherCache.scala

+        val batchRequestCacheKey =
+          BatchIrCache.Key(servingInfo.groupByOps.batchDataset, keys, servingInfo.batchEndTsMillis)
+        val decodedBytes = decodingFunction(batchBytes)
+        if (decodedBytes != null)


Caffeine doesn't allow storing null, and in general I decided to exclude negative caching for now. Most likely, the keys that end up cached are not new keys and do have batch data (e.g. big merchants, or power users).

caiocamatta-stripe · 2024-04-02T17:28:12Z

online/src/main/scala/ai/chronon/online/FetcherCache.scala

+ * FetcherCache is an extension to FetcherBase that provides caching functionality. It caches KV store
+ * requests to decrease feature serving latency.
+ * */
+trait FetcherCache {


Something I considered months ago when making these code changes was to use a class (like "FetcherBaseCached") instead of a trait . This class would inherit from FetcherBase and overrides its methods to add caching. That would arguably be slightly cleaner because it'd give us two completely separate version of the Fetcher. Users would be able to use the normal/old Fetcher with no caching if they wanted.

This would require a significant amount of additional refactoring and re-testing, and I don't think it's worth it. Ideally, once this IR cache is merged in, it becomes a core part of the fetcher that users can enable / disable for their GroupBys as necessary. We've already tested the status quo (no caching), so IMO the risks that arise from additional refactors would outweigh the benefits of having a separate class.

Another challenge with the FetcherBaseCached approach is it will lead to a profusion of FetcherBaseX and FetcherBaseY classes as we keep adding functionalities to the fetcher in the future.

caiocamatta-stripe · 2024-04-02T17:34:18Z

@nikhilsimha and @piyushn-stripe - requesting a first-pass review here. Feel free to assign other people to review it instead.

caiocamatta-stripe · 2024-04-03T18:24:19Z

online/src/main/scala/ai/chronon/online/LRUCache.scala

+  * The original purpose of having an LRU cache in Chronon is to cache KVStore calls and decoded IRs
+  * in the Fetcher. This helps decrease to feature serving latency.
+  */
+object LRUCache {


Caffeine is technically not LRU but it's similar. I think naming this LRUCache makes it easy to understand what it does (and to contrast it with the existing TTLCache).

piyushn-stripe

Largely looks good to me as we've reviewed and are running this code internally off our fork. Had a couple of minor cosmetic changes.

piyushn-stripe · 2024-04-03T20:06:09Z

online/src/main/scala/ai/chronon/online/FetcherBase.scala

+    * A groupBy request is split into batchRequest and optionally a streamingRequest. This method decodes bytes
+    * (of the appropriate avro schema) into chronon rows aggregates further if necessary.
+    *
+    * @param batchResponses a BatchResponses, which encapsulates either a response from kv store or a cached batch IR.
+    * @param streamingResponsesOpt a response from kv store, if the GroupBy was streaming data.
+    * @param oldServingInfo the GroupByServingInfo used to fetch the GroupBys.
+    * @param queryTimeMs the Request timestamp
+    * @param startTimeMs time when we started fetching the KV store
+    * @param overallLatency the time it took to get the values from the KV store
+    * @param context the Metrics.Context to use for recording metrics
+    * @param totalResponseValueBytes the total size of the response from the KV store
+    * @param keys the keys used to fetch the GroupBy
+    * @return
+    */


I don't mind having these around

piyushn-stripe · 2024-04-03T20:12:46Z

online/src/main/scala/ai/chronon/online/FetcherCache.scala

+ * FetcherCache is an extension to FetcherBase that provides caching functionality. It caches KV store
+ * requests to decrease feature serving latency.
+ * */
+trait FetcherCache {


Another challenge with the FetcherBaseCached approach is it will lead to a profusion of FetcherBaseX and FetcherBaseY classes as we keep adding functionalities to the fetcher in the future.

online/src/main/scala/ai/chronon/online/FetcherCache.scala

piyushn-stripe · 2024-04-03T20:16:45Z

online/src/main/scala/ai/chronon/online/FetcherCache.scala

+      groupByName, {
+        groupBy.getMetaData.customJsonLookUp("enable_caching") match {
+          case b: Boolean =>
+            println(s"Caching is ${if (b) "enabled" else "disabled"} for $groupByName")


switch to log (here and others?)

nikhilsimha · 2024-04-18T03:58:40Z

online/src/main/scala/ai/chronon/online/FetcherBase.scala

-      .filter(_.millis >= servingInfo.batchEndTsMillis)
-      .map(_.bytes)
-      .getOrElse(null)
+    // The bulk upload may not have removed an older batch values. We manually discard all but the latest one.


I like this comment

caiocamatta-stripe · 2024-05-13T14:31:03Z

Hey @nikhilsimha, this PR is ready for a second pass. I cleaned up some of the comments, resolved conflicts, and changed the code to use Divya's FlagStore [commit].

Not sure how hard it would be, but it'd be nice if Airbnb could benchmark these changes as well (after merging is fine).

caiocamatta-stripe force-pushed the caiocamatta--fetcher-batch-ir-caching-oss branch 2 times, most recently from 38206fd to a7c7618 Compare February 28, 2024 20:25

caiocamatta-stripe force-pushed the caiocamatta--fetcher-batch-ir-caching-oss branch from a7c7618 to 287f05e Compare April 1, 2024 18:59

caiocamatta-stripe added 10 commits April 1, 2024 15:06

Add LRU Cache

3933137

Double gauge

ab1d2b3

Revert accidental build sbt change

355f101

Move GroupByRequestMeta to object

b52d988

Add FetcherCache and tests

a28ce28

Update FetcherBase to use cache

d7801bb

Scala 2.13 support?

9f711d5

Add getServingInfo unit tests

df3242c

Refactor getServingInfo tests

89d68bb

Fix stub

fd0df49

caiocamatta-stripe force-pushed the caiocamatta--fetcher-batch-ir-caching-oss branch from 287f05e to fd0df49 Compare April 1, 2024 19:49

caiocamatta-stripe added 2 commits April 2, 2024 11:56

Fix Mockito "ambiguous reference to overloaded definition" error

77c711a

Fix "Both batch and streaming data are null" check

0789bea

caiocamatta-stripe changed the title ~~[WIP][CHIP-1] Cache batch IRs in the Fetcher~~ [WIP][CHIP-1] Cache batch IRs in the Fetcher to decrease feature serving latency Apr 2, 2024

caiocamatta-stripe commented Apr 2, 2024

View reviewed changes

caiocamatta-stripe requested review from nikhilsimha and piyushn-stripe April 2, 2024 17:34

caiocamatta-stripe assigned nikhilsimha, piyushn-stripe and caiocamatta-stripe Apr 2, 2024

caiocamatta-stripe changed the title ~~[WIP][CHIP-1] Cache batch IRs in the Fetcher to decrease feature serving latency~~ [CHIP-1] Cache batch IRs in the Fetcher to decrease feature serving latency Apr 3, 2024

caiocamatta-stripe commented Apr 3, 2024

View reviewed changes

piyushn-stripe approved these changes Apr 3, 2024

View reviewed changes

Address PR review: add comments and use logger

acbe30a

piyushn-stripe approved these changes Apr 4, 2024

View reviewed changes

nikhilsimha reviewed Apr 18, 2024

View reviewed changes

caiocamatta-stripe added 2 commits May 9, 2024 13:50

Merge main

7f3683e

Fewer comments for constructGroupByResponse

80bd7cf

caiocamatta-stripe changed the title ~~[CHIP-1] Cache batch IRs in the Fetcher to decrease feature serving latency~~ [CHIP-1] Cache batch IRs in the Fetcher May 9, 2024

Use the FlagStore to determine if caching is enabled

2873a67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CHIP-1] Cache batch IRs in the Fetcher #682

[CHIP-1] Cache batch IRs in the Fetcher #682

caiocamatta-stripe commented Feb 12, 2024 •

edited

caiocamatta-stripe Apr 2, 2024

piyushn-stripe Apr 3, 2024

nikhilsimha Apr 18, 2024 •

edited

caiocamatta-stripe May 9, 2024

caiocamatta-stripe Apr 2, 2024

caiocamatta-stripe Apr 2, 2024

caiocamatta-stripe Apr 2, 2024

piyushn-stripe Apr 3, 2024

caiocamatta-stripe commented Apr 2, 2024

caiocamatta-stripe Apr 3, 2024

piyushn-stripe left a comment

piyushn-stripe Apr 3, 2024

piyushn-stripe Apr 3, 2024

piyushn-stripe Apr 3, 2024

nikhilsimha Apr 18, 2024 •

edited

caiocamatta-stripe commented May 13, 2024

[CHIP-1] Cache batch IRs in the Fetcher #682

Are you sure you want to change the base?

[CHIP-1] Cache batch IRs in the Fetcher #682

Conversation

caiocamatta-stripe commented Feb 12, 2024 • edited

Summary

Enabling the cache

Why / Goal

Test Plan

QA Testing

Online-Offline Consistency

Load tests

Checklist

Reviewers

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikhilsimha Apr 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caiocamatta-stripe commented Apr 2, 2024

Choose a reason for hiding this comment

piyushn-stripe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikhilsimha Apr 18, 2024 • edited

Choose a reason for hiding this comment

caiocamatta-stripe commented May 13, 2024

caiocamatta-stripe commented Feb 12, 2024 •

edited

nikhilsimha Apr 18, 2024 •

edited

nikhilsimha Apr 18, 2024 •

edited