Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

250 millisecond limits are too long #1735

Closed
onukristo opened this issue Feb 15, 2021 · 7 comments · Fixed by octawizard/padles-api#32, enonic/lib-sql#28, navikt/fplos#871 or navikt/fptilbake#1253

Comments

@onukristo
Copy link

We have been using ZGC for a year, we have databases close, reactive and tuned.

Our STW pauses are typically less than 10ms, database connection and validation times 99.9 percentiles are also below 10ms.
We use modern kernels with TCP Fast Open.

Our endpoints' SLOs are typically less than 25ms.

Hikari however has hardcoded 250 ms limits for connection acquire/creation and validation.

Infrequent broken TCP connections are pretty painful with high thresholds like 250 ms.

Our current workarounds are:

Field connectionTimeoutField = HikariConfig.class.getDeclaredField("connectionTimeout");
connectionTimeoutField.setAccessible(true);
connectionTimeoutField.setLong(hikariDataSource, connectionAcquireTimeoutMs);

Field validationTimeoutField = HikariConfig.class.getDeclaredField("validationTimeout");
validationTimeoutField.setAccessible(true);
validationTimeoutField.setLong(hikariDataSource, validationTimeoutMs);
ByteBuddyAgent.install();
new ByteBuddy()
    .redefine(HikariConfig.class)
    .method(ElementMatchers.named("validateNumerics"))
    .intercept(FixedValue.value("Overwritten"))
    .make()
    .load(HikariConfig.class.getClassLoader(), ClassReloadingStrategy.fromInstalledAgent());

Those workarounds can break when something changes in HikariCP new version.

Please consider lowering the hardcoded 250 ms thresholds and/or make those configurable (through system properties?).

Thanks.

@brettwooldridge
Copy link
Owner

brettwooldridge commented Feb 17, 2021

I am not totally against lowering the limit, but I want to understand A) what you believe a valid minimum value is, and B) whether you fully understand the likelihood of false failures resulting in the ejection of all connections from the pool.

Saying "typically less than 10ms" or "typically less than 25ms" does not provide much assurance. The issue is not with the mean, it is with the outliers.

The 250ms limit is what it is because of the danger of running into OS scheduler delays under load. For example, take a look at this question on stackoverflow. Here you can see scheduler delays as large as ~491ms on a Linux 4.20.0 kernel. No scheduler is perfect, and the so called "Completely Fair Scheduler" is as susceptible as any to variations that occassionally result in a "perfect storm" of conditions that cause an excessive delay. A value of the validation threshold that is too low could ultimately result in ejecting a large number of connections from the pool at once.

Are you using metrics (Pometheus, metrics.io, or DropWizard)? Before I would make a change to this timeout, I would want to see a graph of the acquisition times of connections (eg. hikaricp_connection_acquired_nanos) in your pool over the period of a day. The maximum acquisition time recorded is what I am interested in, obviously.

Lastly, I would like to ask, what is the root cause of "infrequent broken TCP connections" you are experiencing? As an alternative to lowering the validation timeout, you might consider trying the new keepaliveTime parameter to attempt to keep a connection from being terminated by the remote server, and proactively eject amd replace connections in the background if they are found to be dead.

@onukristo
Copy link
Author

onukristo commented Mar 2, 2021

Hey, thank you for your answer and sorry for not noticing it before.

We currently measure all jdk safepoints in 300+ services as "gc logs"->mtail->prometheus.

One high RPM service:
image

We don't measure currently all the scheduling pauses, but probably will add jHiccup methodology somewhere (java agent) soon. We do put some effort in our Kubernetes clusters to provision additional buffers to make high load in any worker node improbable.

Here is connection aquire time graph with a network blip (TCP connection getting broken) happening for couple of connections:
image

image

Why do those network blips happen in AWS? To be honest, we don't know, yet. Quite hard to figure it out. But if it happens we would like to fail fast, to not have any threads hanging.

Our network mesh (Envoy) has retry capability and even requests hedging, where in critical paths we have multiple requests going on at the same time to different downstream pods, to tackle "tail lag". Whichever answers first, its response will be used. It's ok to get bad responses from some downstreams, those will be retried or hedged. It's however important to not hang anywhere.

We have keepalive enabled in our Postgres and MariaDb driver configs.

We use a wrapper library around Hikari for configuration. Service owners don't specify Hikari timeouts (nor most other properties) directly, but provide the following parameters instead:

  • longestStwPauses - Specifies how long stop-the-world (STW) pauses (usually garbage-collection (GC) pauses) an application has.
  • connectionAcquireTimeout - Specifies how long are you willing to wait behind a connection pool
  • connectionLatency - Specifies how long cheap databases operations like connecting and validating a connection usually take.

And then, the wrapper libary sets parameters based on that, for example Hikari.validationTimeout = longestStwPauses + connectionLatency. And ofc does some basic validations, that validationTimeout does not end up smaller than connectionAquireTimeout. (otherwise Hikari can "test" only one connection with com.zaxxer.hikari.pool.PoolBase#isConnectionAlive and result with an error if something is wrong with that).

Now, there are services, where 250ms Hikari limit will be good. But then there are other, lets say mission critical credit card processing services, where effort has put into codebase to keep GC pauses and any other pauses minimal. In those the "validationTimeout = longestStwPauses + connectionLatency" would end up around 50ms.

What are the good defaults for Hikari limits? I would say those 250ms are perfect, as in general engineers and organizations don't measure and/or pay attention to any STW pauses. I saw some report than less than 1% companies are using things like Shenandoah and ZGC today, while we have been on those for 2 years now.

I was just wondering, if we could get some kind of backdoor to lower them for engineers knowing what they are doing? Maybe even as simple as via specific system properties.

@brettwooldridge
Copy link
Owner

brettwooldridge commented Mar 3, 2021

@onukristo I am publishing a new release now. It supports a new system property, com.zaxxer.hikari.timeoutMs.floor, that sets the minimum allowed milliseconds for connectionTimeout and validationTimeout. Keep in mind that HikariCP accomplishes these timeouts via get/setNetworkTimeout() on a Connection. Not all drivers implement those methods, in which case sub-second timeouts will have no practical effect.

On an unrelated note, my company switched from ZGC to Shenandoah recently because we uncovered a slow leak in ZGC. Since switching, and observing memory closely, we have detected no such leaks in Shenandoah. Your mileage may vary.

@onukristo
Copy link
Author

Thanks 👍

We mainly use MariaDB and Postgres and both drivers allow sub-second network timeouts. We actually "abuse" drivers' network timeouts heavily for deadlines - every time we do a query, we set a query timeout (there is roundtrip avoiding heuristics for postgres, mariadb allows SET ... FOR statements ) and a slightly higher network timeout (quite cheap operation not needing roundtrips), based on how much deadline (call chain wide) we have left. That allows again to avoid hangs, but also stop any useless resource usage in databases when deadline exceeds. So fault tolerance and no retry-bombs. All that is abstracted to jdbc wrappers based library, so the query and network timeouts are set automatically.

Good info on Shenandoah, I needed some motivation to A/B test it again against ZGC.

@brettwooldridge
Copy link
Owner

brettwooldridge commented Mar 3, 2021

100% speculation on my part, but we do quite a lot of JNI (JNA actually), which I suspect may be related. I tried to get heap dumps for differential comparison, but discovered that heap dumps are currently not supported by either ZGC or Shenandoah, so that diagnostic path was thwarted. I ended up just watching the system over a one week period and detected no leak with Shenandoah where it was obvious with ZGC. I will note that the leak was "off heap", meaning not in the Java heap, but rather the JVM process heap. That much I could tell. Shenandoah also now has lower average pause times than ZGC, at least from my profiling.

@brettwooldridge
Copy link
Owner

HikariCP v4.0.3 should be resolvable now from the maven central repository.

@onukristo
Copy link
Author

We have been running with "-XX:ZCollectionInterval=60". It helps to clean/finalize native things, where some resources have not been correctly closed through finally{} block(let's say file descriptors). It works as expected, and seems to be very light-weight on CPU still, no visible overhead added with that setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment