New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
250 millisecond limits are too long #1735
250 millisecond limits are too long #1735
Comments
I am not totally against lowering the limit, but I want to understand A) what you believe a valid minimum value is, and B) whether you fully understand the likelihood of false failures resulting in the ejection of all connections from the pool. Saying "typically less than 10ms" or "typically less than 25ms" does not provide much assurance. The issue is not with the mean, it is with the outliers. The 250ms limit is what it is because of the danger of running into OS scheduler delays under load. For example, take a look at this question on stackoverflow. Here you can see scheduler delays as large as ~491ms on a Linux 4.20.0 kernel. No scheduler is perfect, and the so called "Completely Fair Scheduler" is as susceptible as any to variations that occassionally result in a "perfect storm" of conditions that cause an excessive delay. A value of the validation threshold that is too low could ultimately result in ejecting a large number of connections from the pool at once. Are you using metrics (Pometheus, metrics.io, or DropWizard)? Before I would make a change to this timeout, I would want to see a graph of the acquisition times of connections (eg. hikaricp_connection_acquired_nanos) in your pool over the period of a day. The maximum acquisition time recorded is what I am interested in, obviously. Lastly, I would like to ask, what is the root cause of "infrequent broken TCP connections" you are experiencing? As an alternative to lowering the validation timeout, you might consider trying the new |
Hey, thank you for your answer and sorry for not noticing it before. We currently measure all jdk safepoints in 300+ services as "gc logs"->mtail->prometheus. We don't measure currently all the scheduling pauses, but probably will add jHiccup methodology somewhere (java agent) soon. We do put some effort in our Kubernetes clusters to provision additional buffers to make high load in any worker node improbable. Here is connection aquire time graph with a network blip (TCP connection getting broken) happening for couple of connections: Why do those network blips happen in AWS? To be honest, we don't know, yet. Quite hard to figure it out. But if it happens we would like to fail fast, to not have any threads hanging. Our network mesh (Envoy) has retry capability and even requests hedging, where in critical paths we have multiple requests going on at the same time to different downstream pods, to tackle "tail lag". Whichever answers first, its response will be used. It's ok to get bad responses from some downstreams, those will be retried or hedged. It's however important to not hang anywhere. We have keepalive enabled in our Postgres and MariaDb driver configs. We use a wrapper library around Hikari for configuration. Service owners don't specify Hikari timeouts (nor most other properties) directly, but provide the following parameters instead:
And then, the wrapper libary sets parameters based on that, for example Hikari.validationTimeout = longestStwPauses + connectionLatency. And ofc does some basic validations, that validationTimeout does not end up smaller than connectionAquireTimeout. (otherwise Hikari can "test" only one connection with com.zaxxer.hikari.pool.PoolBase#isConnectionAlive and result with an error if something is wrong with that). Now, there are services, where 250ms Hikari limit will be good. But then there are other, lets say mission critical credit card processing services, where effort has put into codebase to keep GC pauses and any other pauses minimal. In those the "validationTimeout = longestStwPauses + connectionLatency" would end up around 50ms. What are the good defaults for Hikari limits? I would say those 250ms are perfect, as in general engineers and organizations don't measure and/or pay attention to any STW pauses. I saw some report than less than 1% companies are using things like Shenandoah and ZGC today, while we have been on those for 2 years now. I was just wondering, if we could get some kind of backdoor to lower them for engineers knowing what they are doing? Maybe even as simple as via specific system properties. |
@onukristo I am publishing a new release now. It supports a new system property, On an unrelated note, my company switched from ZGC to Shenandoah recently because we uncovered a slow leak in ZGC. Since switching, and observing memory closely, we have detected no such leaks in Shenandoah. Your mileage may vary. |
Thanks 👍 We mainly use MariaDB and Postgres and both drivers allow sub-second network timeouts. We actually "abuse" drivers' network timeouts heavily for deadlines - every time we do a query, we set a query timeout (there is roundtrip avoiding heuristics for postgres, mariadb allows SET ... FOR statements ) and a slightly higher network timeout (quite cheap operation not needing roundtrips), based on how much deadline (call chain wide) we have left. That allows again to avoid hangs, but also stop any useless resource usage in databases when deadline exceeds. So fault tolerance and no retry-bombs. All that is abstracted to jdbc wrappers based library, so the query and network timeouts are set automatically. Good info on Shenandoah, I needed some motivation to A/B test it again against ZGC. |
100% speculation on my part, but we do quite a lot of JNI (JNA actually), which I suspect may be related. I tried to get heap dumps for differential comparison, but discovered that heap dumps are currently not supported by either ZGC or Shenandoah, so that diagnostic path was thwarted. I ended up just watching the system over a one week period and detected no leak with Shenandoah where it was obvious with ZGC. I will note that the leak was "off heap", meaning not in the Java heap, but rather the JVM process heap. That much I could tell. Shenandoah also now has lower average pause times than ZGC, at least from my profiling. |
HikariCP v4.0.3 should be resolvable now from the maven central repository. |
We have been running with "-XX:ZCollectionInterval=60". It helps to clean/finalize native things, where some resources have not been correctly closed through finally{} block(let's say file descriptors). It works as expected, and seems to be very light-weight on CPU still, no visible overhead added with that setting. |
We have been using ZGC for a year, we have databases close, reactive and tuned.
Our STW pauses are typically less than 10ms, database connection and validation times 99.9 percentiles are also below 10ms.
We use modern kernels with TCP Fast Open.
Our endpoints' SLOs are typically less than 25ms.
Hikari however has hardcoded 250 ms limits for connection acquire/creation and validation.
Infrequent broken TCP connections are pretty painful with high thresholds like 250 ms.
Our current workarounds are:
Those workarounds can break when something changes in HikariCP new version.
Please consider lowering the hardcoded 250 ms thresholds and/or make those configurable (through system properties?).
Thanks.
The text was updated successfully, but these errors were encountered: