-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential Cache Stampede with Caffeine Cached Endpoint #11506
Comments
Thanks. I added a reproducer here: https://github.com/mkurz/playframework-reproducers/tree/main/playframework-11506 |
Given the akka-http example code above: |
Thinking out loud: def getOrElseUpdate[A: ClassTag](key: String, expiration: A => Duration)(orElse: => Future[A]): Future[A] which would mean we can can set the expiration based on the calculcated value. In our case we would have a cache.getOrElseUpdate(resultKey, result => result.attrs(Attr.EXPIRE)) //Attr.EXPIRE would be a scala Duration |
Good news is the Java API is not affected because it uses
I added reproducer-java which confirms that, works like intented.
|
Oh wow, thank you. Let me know if I can assist you in this. |
@demming I published a Play SNAPSHOT version with a potential fix. I applied it to the reproducer projects I host in a branch called I assume you still have the projects available you used to test this behaviour(?) Would it be possible you upgrade to that snapshot version and let me know if that fixes the problem in your projects as well - and how it compares to the akka one performance wise? Would be nice if you let me know how that turned out. Thanks! BTW: The patched are #11516, #11511 and #11515 and still need work. |
@mkurz Amazing, great work, thank you. The mitigation seems to work right. I'm going to test it more thoroughly the next couple of days. I haven't had a chance yet to take a look at your fix. But have you considered implementing all of the standard means of mitigation so the user gets an option to optimize for it? Apparently Spring and Quarkus do only locking. As for performance, Statistics Avg Stdev Max
Reqs/sec 4571.74 972.35 8405.81
Latency 21.85ms 41.08ms 0.88s
Latency Distribution
50% 18.57ms
75% 24.12ms
90% 30.43ms
95% 35.39ms
99% 58.70ms
HTTP codes:
1xx - 0, 2xx - 45800, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 3.07GB/s it's still (significantly) lagging behind Akka HTTP's (with only Statistics Avg Stdev Max
Reqs/sec 7622.06 1691.19 13461.33
Latency 13.11ms 14.30ms 736.26ms
Latency Distribution
50% 11.89ms
75% 15.43ms
90% 19.82ms
95% 23.29ms
99% 35.53ms
HTTP codes:
1xx - 0, 2xx - 76297, 3xx - 0, 4xx - 0, 5xx - 0
others - 0
Throughput: 5.12GB/s I'll see what comes up in profiling. |
Did you run in prod mode? |
Yep |
Having now looked at the profiles, I see there's a lot of Framework overhead occupy on additional threads but the two major relevant worker threads are not as good synchronized in Play as in my Akka implementation, which perhaps is due to Akka's Futures being wrapped in Cats Effect IO which should better orchestrate the thread pool. In the Play profiles I captured, the worker threads are suspended for a longer period of time. Might also be caused by the framework threads with significant workload in the background that aren't related to the cached endpoint but consume the thread pool and especially memory. Don't know yet how to tackle it. In addition, for some reason, the populating writes to cache sometimes take too much time in Play, captured one such case in the profile. Will share what I have when I'm back at work. Reduced the workload to one connection for cache population and then 5s constant burst with one connection. |
Play Version
2.8.18
API
Scala
Operating System
macOS 12.6
JDK
OpenJDK 19.0.1, target Java 1.8.
Library Dependencies
Expected Behavior
Running (with implicit connection reuse)
bombardier -c 100 -d 10s "http://localhost:9000/website?address=http://localhost:8080"
for load and soak testing the endpoint which is expected to retrieve a website's HTML and sanitize it using the OWASP HtmlSanitizer library, I expect that
-Xmx64m
max heap occupying up to 250m RSS at peak.Note that I also encountered the same issue with "Response Caching" in ASP.NET Core 6 and 7-rc1, 7-rc2, as reported in dotnet/aspnetcore#44696 (comment) but not with the upcoming "Output Caching"---mitigations are in place, albeit apparently with a performance degradation according to my observations.
Actual Behavior
Future
appears to get evaluated twice), so I added blockingprintln
statements to make the order of execution clear using global counters.LfuCache
produces up to 50% higher throughput (at lower latencies).To prevent observations 1-3 from happening suffice that I run a preliminary request against that endpoint to populate the cache, so all subsequent concurrent requests are served from that cache for the given key.
Cache stampede as described on Wikipedia was also claimed to cause an outage at Facebook.
Reproducible Test Case
Just add this endpoint to the default
HomeController
The route is just
Perfectly working Akka HTTP cache configuration:
The text was updated successfully, but these errors were encountered: