Potential Cache Stampede with Caffeine Cached Endpoint #11506

demming · 2022-10-24T08:38:37Z

Play Version

2.8.18

API

Scala

Operating System

macOS 12.6

JDK

OpenJDK 19.0.1, target Java 1.8.

Library Dependencies

    libraryDependencies ++= Seq(
      guice,
      "org.scalatestplus.play" %% "scalatestplus-play" % "5.1.0" % Test
      , "com.googlecode.owasp-java-html-sanitizer" % "owasp-java-html-sanitizer" % "20220608.1"
      , caffeine
      , ws
      )

Expected Behavior

Running (with implicit connection reuse)

bombardier -c 100 -d 10s "http://localhost:9000/website?address=http://localhost:8080"

for load and soak testing the endpoint which is expected to retrieve a website's HTML and sanitize it using the OWASP HtmlSanitizer library, I expect that

When 100 concurrent inbound connections hit the endpoint, only one cache population should occur, not 100, not 200.
99 of the 100 initial inbound concurrent connections should be served from cache.
JVM should not go out of memory or crash.
The throughput should not be significantly different from bare Akka HTTP (underneath sttp/tapir in my case), which works simply amazingly at just -Xmx64m max heap occupying up to 250m RSS at peak.

Note that I also encountered the same issue with "Response Caching" in ASP.NET Core 6 and 7-rc1, 7-rc2, as reported in dotnet/aspnetcore#44696 (comment) but not with the upcoming "Output Caching"---mitigations are in place, albeit apparently with a performance degradation according to my observations.

Actual Behavior

200 service invocations take place (the Future appears to get evaluated twice), so I added blocking println statements to make the order of execution clear using global counters.
Each inbound connection encounters a cache miss and hence populates the cache by invoking the service, leading to thread starvation (expired WS calls) and JVM going out of memory.
JVM goes out of memory and crashes.
A bare Akka HTTP server with LfuCache produces up to 50% higher throughput (at lower latencies).

To prevent observations 1-3 from happening suffice that I run a preliminary request against that endpoint to populate the cache, so all subsequent concurrent requests are served from that cache for the given key.

Cache stampede as described on Wikipedia was also claimed to cause an outage at Facebook.

Reproducible Test Case

Just add this endpoint to the default HomeController

@Singleton
class HomeController @Inject()
(cached                            : Cached,
 ws                                : WSClient,
 override val controllerComponents : ControllerComponents,
) extends AbstractController(controllerComponents) with Logging {

  import org.owasp.html.Sanitizers

  val policy = Sanitizers.FORMATTING
                         .and(Sanitizers.LINKS)
                         .and(Sanitizers.TABLES)
                         .and(Sanitizers.BLOCKS)
                         .and(Sanitizers.IMAGES)
                         .and(Sanitizers.STYLES)

  def websiteSanitized(address : String)  = cached((reqH: RequestHeader) => f"website-sanitized-${reqH.uri}", duration = 90.seconds) {
    Action.async { implicit request : Request [AnyContent] =>
      _controllerNumber += 1
      logger.info(f"> ${_controllerNumber} controller invocation")
      println(f"> ${_controllerNumber} controller invocation")

      val ok = for {
        response <- {
          _requestNumber += 1
          logger.info(f"> ${_requestNumber} Running HTTP request")
          println(f"> ${_requestNumber} Running HTTP request")

          val upstreamRequest  = ws.url(address).withRequestTimeout(10000.millis)
          upstreamRequest.get()
        }
        result = policy.sanitize(response.body)
      } yield {
        Ok(result)
      }

      ok
    }
  }

}

The route is just

GET     /website            controllers.HomeController.websiteSanitized(address)

Perfectly working Akka HTTP cache configuration:

    implicit val system           = ActorSystem(Behaviors.empty, "website-sanitizer-http-rest-server")
    implicit val executionContext = system.executionContext

    val defaultCachingSettings = CachingSettings(system)
    val lfuCacheSettings =
      defaultCachingSettings.lfuCacheSettings
                            .withInitialCapacity(50)
                            .withMaxCapacity(100)
                            .withTimeToLive(90.seconds)
                            .withTimeToIdle(30.seconds)
    val cachingSettings = defaultCachingSettings.withLfuCacheSettings(lfuCacheSettings)
    val lfuCache : Cache [Uri, RouteResult] = LfuCache(cachingSettings)

The text was updated successfully, but these errors were encountered:

mkurz · 2022-10-25T12:10:41Z

Thanks. I added a reproducer here: https://github.com/mkurz/playframework-reproducers/tree/main/playframework-11506
I didn't dig very deep in akka-http but it seems they are using Caffeines atomic get method that takes a method to update an entry.
Currently we use get to retrieve the cache and later set it. We might want to switch to Play's getOrElseUpdate method introduced in Play 2.8 to make that atomic.
@gmethvin What do you think?

mkurz · 2022-10-25T13:53:48Z

Given the akka-http example code above:
Akka sets up the expire times only once: https://github.com/akka/akka-http/blob/27598478891acc679c82654f1b0b7dbc6dc04647/akka-http-caching/src/main/scala/akka/http/caching/LfuCache.scala#L75-L83
In Play however using Cached users can set different expire times for different action methods, so that makes it a bit harder for us here to come up with the expire time before the action was even run to pass it to getOrElseUpdate. ~~We might want to directly work with the caffeine api to make that work. Not sure yet however...~~

mkurz · 2022-10-25T15:34:18Z

Thinking out loud:
Part of the solution IMHO would be to have an override of getOrElseUpdate of the cache api that looks like:

def getOrElseUpdate[A: ClassTag](key: String, expiration: A => Duration)(orElse: => Future[A]): Future[A]

which would mean we can can set the expiration based on the calculcated value. In our case we would have a Result that we want to cache so we can use result attributes (#10172) to store the calculated expire date and then set it for the cache like:

cache.getOrElseUpdate(resultKey, result => result.attrs(Attr.EXPIRE)) //Attr.EXPIRE would be a scala Duration

mkurz · 2022-10-25T23:51:18Z

Good news is the Java API is not affected because it uses getOrElseUpdate since #7624 already:

playframework/cache/play-cache/src/main/java/play/cache/CachedAction.java

Line 28 in ccf99ba

return cacheApi.getOrElseUpdate(key, () -> delegate.call(req), duration);

I added reproducer-java which confirms that, works like intented.

demming · 2022-10-26T18:49:48Z

Oh wow, thank you. Let me know if I can assist you in this.

mkurz · 2022-10-26T23:06:13Z

@demming I published a Play SNAPSHOT version with a potential fix. I applied it to the reproducer projects I host in a branch called fixed-11506: https://github.com/mkurz/playframework-reproducers/tree/fixed-11506/playframework-11506

I assume you still have the projects available you used to test this behaviour(?) Would it be possible you upgrade to that snapshot version and let me know if that fixes the problem in your projects as well - and how it compares to the akka one performance wise?
You need to add resolvers to the build.sbt and project/plugins.sbt file, like in this diff: mkurz/playframework-reproducers@main...fixed-11506 As you can see the snapshot is called 2.9.0-M2+98-0abc3bd3-SNAPSHOT. You probably also need to set the scala version, like shown in the diff, to 2.12.17 for the plugins as well, otherwise you probably run into dependency conflicts.

Would be nice if you let me know how that turned out. Thanks!

BTW: The patched are #11516, #11511 and #11515 and still need work.

demming · 2022-10-27T20:01:52Z

@mkurz Amazing, great work, thank you. The mitigation seems to work right. I'm going to test it more thoroughly the next couple of days.

I haven't had a chance yet to take a look at your fix. But have you considered implementing all of the standard means of mitigation so the user gets an option to optimize for it? Apparently Spring and Quarkus do only locking.

As for performance,

Statistics        Avg      Stdev        Max
  Reqs/sec      4571.74     972.35    8405.81
  Latency       21.85ms    41.08ms      0.88s
  Latency Distribution
     50%    18.57ms
     75%    24.12ms
     90%    30.43ms
     95%    35.39ms
     99%    58.70ms
  HTTP codes:
    1xx - 0, 2xx - 45800, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     3.07GB/s

it's still (significantly) lagging behind Akka HTTP's (with only -Xmx64m):

Statistics        Avg      Stdev        Max
  Reqs/sec      7622.06    1691.19   13461.33
  Latency       13.11ms    14.30ms   736.26ms
  Latency Distribution
     50%    11.89ms
     75%    15.43ms
     90%    19.82ms
     95%    23.29ms
     99%    35.53ms
  HTTP codes:
    1xx - 0, 2xx - 76297, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     5.12GB/s

I'll see what comes up in profiling.

mkurz · 2022-10-27T20:04:59Z

Did you run in prod mode?

demming · 2022-10-28T09:27:27Z

Did you run in prod mode?

Yep

demming · 2022-10-28T21:18:38Z

Having now looked at the profiles, I see there's a lot of Framework overhead occupy on additional threads but the two major relevant worker threads are not as good synchronized in Play as in my Akka implementation, which perhaps is due to Akka's Futures being wrapped in Cats Effect IO which should better orchestrate the thread pool. In the Play profiles I captured, the worker threads are suspended for a longer period of time. Might also be caused by the framework threads with significant workload in the background that aren't related to the cached endpoint but consume the thread pool and especially memory. Don't know yet how to tackle it.

In addition, for some reason, the populating writes to cache sometimes take too much time in Play, captured one such case in the profile. Will share what I have when I'm back at work.

Reduced the workload to one connection for cache population and then 5s constant burst with one connection.

This was referenced Oct 24, 2022

Subpar Caffeine cache performance (vs Akka HTTP, Play2, ASP.NET Core) quarkusio/quarkus#28795

Closed

Cache stampede and subpar performance micronaut-projects/micronaut-cache#520

Open

This was referenced Oct 25, 2022

Override getOrElseUpdate to set expiration based on value #11511

Draft

Cached should use atomic getOrElseUpdate #11515

Draft

mkurz mentioned this issue Oct 26, 2022

Don't cache elements for 1 second, makes no sense #11516

Draft

mkurz added this to the 2.9.0 milestone Oct 27, 2022

demming mentioned this issue Oct 31, 2022

🐛 [Bug]: Cache stampede with caching middleware gofiber/fiber#2182

Open

3 tasks

mkurz modified the milestones: 2.9.0, 2.10.0 May 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Cache Stampede with Caffeine Cached Endpoint #11506

Potential Cache Stampede with Caffeine Cached Endpoint #11506

demming commented Oct 24, 2022

mkurz commented Oct 25, 2022

mkurz commented Oct 25, 2022 •

edited

mkurz commented Oct 25, 2022

mkurz commented Oct 25, 2022

demming commented Oct 26, 2022

mkurz commented Oct 26, 2022 •

edited

demming commented Oct 27, 2022

mkurz commented Oct 27, 2022

demming commented Oct 28, 2022

demming commented Oct 28, 2022

Potential Cache Stampede with Caffeine Cached Endpoint #11506

Potential Cache Stampede with Caffeine Cached Endpoint #11506

Comments

demming commented Oct 24, 2022

Play Version

API

Operating System

JDK

Library Dependencies

Expected Behavior

Actual Behavior

Reproducible Test Case

mkurz commented Oct 25, 2022

mkurz commented Oct 25, 2022 • edited

mkurz commented Oct 25, 2022

mkurz commented Oct 25, 2022

demming commented Oct 26, 2022

mkurz commented Oct 26, 2022 • edited

demming commented Oct 27, 2022

mkurz commented Oct 27, 2022

demming commented Oct 28, 2022

demming commented Oct 28, 2022

mkurz commented Oct 25, 2022 •

edited

mkurz commented Oct 26, 2022 •

edited