Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Cache Stampede with Caffeine Cached Endpoint #11506

Open
demming opened this issue Oct 24, 2022 · 10 comments · May be fixed by #11511
Open

Potential Cache Stampede with Caffeine Cached Endpoint #11506

demming opened this issue Oct 24, 2022 · 10 comments · May be fixed by #11511

Comments

@demming
Copy link

demming commented Oct 24, 2022

Play Version

2.8.18

API

Scala

Operating System

macOS 12.6

JDK

OpenJDK 19.0.1, target Java 1.8.

Library Dependencies

    libraryDependencies ++= Seq(
      guice,
      "org.scalatestplus.play" %% "scalatestplus-play" % "5.1.0" % Test
      , "com.googlecode.owasp-java-html-sanitizer" % "owasp-java-html-sanitizer" % "20220608.1"
      , caffeine
      , ws
      )

Expected Behavior

Running (with implicit connection reuse)

bombardier -c 100 -d 10s "http://localhost:9000/website?address=http://localhost:8080"

for load and soak testing the endpoint which is expected to retrieve a website's HTML and sanitize it using the OWASP HtmlSanitizer library, I expect that

  1. When 100 concurrent inbound connections hit the endpoint, only one cache population should occur, not 100, not 200.
  2. 99 of the 100 initial inbound concurrent connections should be served from cache.
  3. JVM should not go out of memory or crash.
  4. The throughput should not be significantly different from bare Akka HTTP (underneath sttp/tapir in my case), which works simply amazingly at just -Xmx64m max heap occupying up to 250m RSS at peak.

Note that I also encountered the same issue with "Response Caching" in ASP.NET Core 6 and 7-rc1, 7-rc2, as reported in dotnet/aspnetcore#44696 (comment) but not with the upcoming "Output Caching"---mitigations are in place, albeit apparently with a performance degradation according to my observations.

Actual Behavior

  1. 200 service invocations take place (the Future appears to get evaluated twice), so I added blocking println statements to make the order of execution clear using global counters.
  2. Each inbound connection encounters a cache miss and hence populates the cache by invoking the service, leading to thread starvation (expired WS calls) and JVM going out of memory.
  3. JVM goes out of memory and crashes.
  4. A bare Akka HTTP server with LfuCache produces up to 50% higher throughput (at lower latencies).

To prevent observations 1-3 from happening suffice that I run a preliminary request against that endpoint to populate the cache, so all subsequent concurrent requests are served from that cache for the given key.

Cache stampede as described on Wikipedia was also claimed to cause an outage at Facebook.

Reproducible Test Case

Just add this endpoint to the default HomeController

@Singleton
class HomeController @Inject()
(cached                            : Cached,
 ws                                : WSClient,
 override val controllerComponents : ControllerComponents,
) extends AbstractController(controllerComponents) with Logging {

  import org.owasp.html.Sanitizers

  val policy = Sanitizers.FORMATTING
                         .and(Sanitizers.LINKS)
                         .and(Sanitizers.TABLES)
                         .and(Sanitizers.BLOCKS)
                         .and(Sanitizers.IMAGES)
                         .and(Sanitizers.STYLES)

  def websiteSanitized(address : String)  = cached((reqH: RequestHeader) => f"website-sanitized-${reqH.uri}", duration = 90.seconds) {
    Action.async { implicit request : Request [AnyContent] =>
      _controllerNumber += 1
      logger.info(f"> ${_controllerNumber} controller invocation")
      println(f"> ${_controllerNumber} controller invocation")

      val ok = for {
        response <- {
          _requestNumber += 1
          logger.info(f"> ${_requestNumber} Running HTTP request")
          println(f"> ${_requestNumber} Running HTTP request")

          val upstreamRequest  = ws.url(address).withRequestTimeout(10000.millis)
          upstreamRequest.get()
        }
        result = policy.sanitize(response.body)
      } yield {
        Ok(result)
      }

      ok
    }
  }

}

The route is just

GET     /website            controllers.HomeController.websiteSanitized(address)

Perfectly working Akka HTTP cache configuration:

    implicit val system           = ActorSystem(Behaviors.empty, "website-sanitizer-http-rest-server")
    implicit val executionContext = system.executionContext

    val defaultCachingSettings = CachingSettings(system)
    val lfuCacheSettings =
      defaultCachingSettings.lfuCacheSettings
                            .withInitialCapacity(50)
                            .withMaxCapacity(100)
                            .withTimeToLive(90.seconds)
                            .withTimeToIdle(30.seconds)
    val cachingSettings = defaultCachingSettings.withLfuCacheSettings(lfuCacheSettings)
    val lfuCache : Cache [Uri, RouteResult] = LfuCache(cachingSettings)
@mkurz
Copy link
Member

mkurz commented Oct 25, 2022

Thanks. I added a reproducer here: https://github.com/mkurz/playframework-reproducers/tree/main/playframework-11506
I didn't dig very deep in akka-http but it seems they are using Caffeines atomic get method that takes a method to update an entry.
Currently we use get to retrieve the cache and later set it. We might want to switch to Play's getOrElseUpdate method introduced in Play 2.8 to make that atomic.
@gmethvin What do you think?

@mkurz
Copy link
Member

mkurz commented Oct 25, 2022

Given the akka-http example code above:
Akka sets up the expire times only once: https://github.com/akka/akka-http/blob/27598478891acc679c82654f1b0b7dbc6dc04647/akka-http-caching/src/main/scala/akka/http/caching/LfuCache.scala#L75-L83
In Play however using Cached users can set different expire times for different action methods, so that makes it a bit harder for us here to come up with the expire time before the action was even run to pass it to getOrElseUpdate. We might want to directly work with the caffeine api to make that work. Not sure yet however...

@mkurz
Copy link
Member

mkurz commented Oct 25, 2022

Thinking out loud:
Part of the solution IMHO would be to have an override of getOrElseUpdate of the cache api that looks like:

def getOrElseUpdate[A: ClassTag](key: String, expiration: A => Duration)(orElse: => Future[A]): Future[A]

which would mean we can can set the expiration based on the calculcated value. In our case we would have a Result that we want to cache so we can use result attributes (#10172) to store the calculated expire date and then set it for the cache like:

cache.getOrElseUpdate(resultKey, result => result.attrs(Attr.EXPIRE)) //Attr.EXPIRE would be a scala Duration

@mkurz
Copy link
Member

mkurz commented Oct 25, 2022

Good news is the Java API is not affected because it uses getOrElseUpdate since #7624 already:

return cacheApi.getOrElseUpdate(key, () -> delegate.call(req), duration);

I added reproducer-java which confirms that, works like intented.

@demming
Copy link
Author

demming commented Oct 26, 2022

Oh wow, thank you. Let me know if I can assist you in this.

@mkurz
Copy link
Member

mkurz commented Oct 26, 2022

@demming I published a Play SNAPSHOT version with a potential fix. I applied it to the reproducer projects I host in a branch called fixed-11506: https://github.com/mkurz/playframework-reproducers/tree/fixed-11506/playframework-11506

I assume you still have the projects available you used to test this behaviour(?) Would it be possible you upgrade to that snapshot version and let me know if that fixes the problem in your projects as well - and how it compares to the akka one performance wise?
You need to add resolvers to the build.sbt and project/plugins.sbt file, like in this diff: mkurz/playframework-reproducers@main...fixed-11506 As you can see the snapshot is called 2.9.0-M2+98-0abc3bd3-SNAPSHOT. You probably also need to set the scala version, like shown in the diff, to 2.12.17 for the plugins as well, otherwise you probably run into dependency conflicts.

Would be nice if you let me know how that turned out. Thanks!

BTW: The patched are #11516, #11511 and #11515 and still need work.

@mkurz mkurz added this to the 2.9.0 milestone Oct 27, 2022
@demming
Copy link
Author

demming commented Oct 27, 2022

@mkurz Amazing, great work, thank you. The mitigation seems to work right. I'm going to test it more thoroughly the next couple of days.

I haven't had a chance yet to take a look at your fix. But have you considered implementing all of the standard means of mitigation so the user gets an option to optimize for it? Apparently Spring and Quarkus do only locking.

As for performance,

Statistics        Avg      Stdev        Max
  Reqs/sec      4571.74     972.35    8405.81
  Latency       21.85ms    41.08ms      0.88s
  Latency Distribution
     50%    18.57ms
     75%    24.12ms
     90%    30.43ms
     95%    35.39ms
     99%    58.70ms
  HTTP codes:
    1xx - 0, 2xx - 45800, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     3.07GB/s

it's still (significantly) lagging behind Akka HTTP's (with only -Xmx64m):

Statistics        Avg      Stdev        Max
  Reqs/sec      7622.06    1691.19   13461.33
  Latency       13.11ms    14.30ms   736.26ms
  Latency Distribution
     50%    11.89ms
     75%    15.43ms
     90%    19.82ms
     95%    23.29ms
     99%    35.53ms
  HTTP codes:
    1xx - 0, 2xx - 76297, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     5.12GB/s

I'll see what comes up in profiling.

@mkurz
Copy link
Member

mkurz commented Oct 27, 2022

Did you run in prod mode?

@demming
Copy link
Author

demming commented Oct 28, 2022

Did you run in prod mode?

Yep

@demming
Copy link
Author

demming commented Oct 28, 2022

Having now looked at the profiles, I see there's a lot of Framework overhead occupy on additional threads but the two major relevant worker threads are not as good synchronized in Play as in my Akka implementation, which perhaps is due to Akka's Futures being wrapped in Cats Effect IO which should better orchestrate the thread pool. In the Play profiles I captured, the worker threads are suspended for a longer period of time. Might also be caused by the framework threads with significant workload in the background that aren't related to the cached endpoint but consume the thread pool and especially memory. Don't know yet how to tackle it.

In addition, for some reason, the populating writes to cache sometimes take too much time in Play, captured one such case in the profile. Will share what I have when I'm back at work.

Reduced the workload to one connection for cache population and then 5s constant burst with one connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

2 participants