Configurable MaxLifeTime Variance #2035

Avarga954 · 2023-02-01T23:32:34Z

Refactoring MaxLifeTime connection variance to utilize a configurable setting rather than a static 2.5%.
2.5% will remain the default however developers can choose to override this for a more aggressive percentage if desired.

… setting rather than a static 2.5%. 2.5% will remain the default however developers can choose to override this for a more aggressive percentage if desired.

Avarga954 · 2023-02-01T23:40:27Z

Just wanted to leave a comment linking back to the open issue where I discussed the problem a little bit more at length.

#2004

Avarga954 · 2023-02-10T21:22:55Z

Here is an example of the default hikari variance that has a maximum of 2.5% variance

Here is an example with a much more aggresive maximum variance of 30%

The large spikes / mass closing of connections is much more smoothed out.

bruce-ricard · 2023-03-03T19:32:34Z

src/main/java/com/zaxxer/hikari/HikariConfig.java

@@ -53,6 +53,7 @@ public class HikariConfig implements HikariConfigMXBean
   private static final long SOFT_TIMEOUT_FLOOR = Long.getLong("com.zaxxer.hikari.timeoutMs.floor", 250L);
   private static final long IDLE_TIMEOUT = MINUTES.toMillis(10);
   private static final long MAX_LIFETIME = MINUTES.toMillis(30);
+   private static final double MAX_LIFETIME_VARIANCE = 2.5;


Maybe this constant should be called DEFAULT_MAX_LIFETIME_VARIANCE instead? As it appears to be the default value.

Great suggestion, went ahead and changed it.

I don't want to give you false expectations that your PR will be merged soon. I don't have push permissions. I'm just an open source engineer happening to look at this PR, because I want to see how reactive the maintainers are.

And thank you for your kind response to my review. I wish more developers appreciated code review suggestions :).

bruce-ricard · 2023-03-06T16:38:00Z

src/main/java/com/zaxxer/hikari/HikariConfig.java

@@ -66,6 +67,7 @@ public class HikariConfig implements HikariConfigMXBean
   private volatile long idleTimeout;
   private volatile long leakDetectionThreshold;
   private volatile long maxLifetime;
+   private volatile double maxLifetimeVariance;


I personally really try to avoid floating point numbers as much as I can, as they cause a lot of trouble. ( What Every Computer Scientist Should Know About Floating-Point Arithmetic )

I understand why you went for them: you need to be able to represent a value like 2.5%, and it has a decimal point, which makes them not representable by integers.

I have a suggestion. How about you use per mille to represent the variance instead of percent. That way, you could change your type to int, which is a lot more friendly than double, and you'll be able to represent 2.5% as 25‰. If you think that we'll need more precision than that, you can change the type to make it more precise. Like "per million" for example. Though IMO "per mille" is enough.

Please let me know your thoughts.

sksamuel · 2023-04-16T16:27:07Z

Would love to see this merged.

sean- · 2023-09-22T17:42:53Z

Can someone please bless and merge this PR? Thundering herds are problematic and can only be prevented by the clients staggering their connections.

lfbayer · 2023-11-04T13:04:26Z

I think I would prefer a solution to this problem that doesn't require the user to configure anything. I don't see any reason why we wouldn't want to avoid these timing issues for everyone.

It seems like there has to be a way to implement this that would allow the connection close timings to be as near a flat graph as possible.

If, for example, the lifetime was 10 minutes and the connection count is 10, to be flat we would want a connection closed once per minute (rather than 10 near the end of 10 minutes). If the system is up and running with a fixed size pool this would be a fairly straight forward. But even with a fixed size pool, you don't really want a connection closed after exactly one minute of running after startup. So there would be some heuristic to ease into the target flat graph. In the case of a non-fixed size pool it might be a little less straight forward.

In this scenario, even when closing 1 every minute, we wouldn't likely want a little randomness in there as if there were two applications both running HikariCP pointed at the same database, there is a chance that they would be on identical timings which would result in a similar spiky graph as far as the database is concerned. I haven't actually thought this out but my instinct tells me that would be a concern.

Any thoughts?

Avarga954 · 2023-11-04T18:51:13Z

I think I would prefer a solution to this problem that doesn't require the user to configure anything. I don't see any reason why we wouldn't want to avoid these timing issues for everyone.

My thought process for this PR was to make this configuration option transparent by default as not to impact any existing deployments existing Hikari config settings on upgrade. However this could be achieved by simply changing the default value I provided to be something more aggressive than 2.5% which is what it has OOTB today.

It seems like there has to be a way to implement this that would allow the connection close timings to be as near a flat graph as possible.

If, for example, the lifetime was 10 minutes and the connection count is 10, to be flat we would want a connection closed once per minute (rather than 10 near the end of 10 minutes). If the system is up and running with a fixed size pool this would be a fairly straight forward. But even with a fixed size pool, you don't really want a connection closed after exactly one minute of running after startup. So there would be some heuristic to ease into the target flat graph. In the case of a non-fixed size pool it might be a little less straight forward.

In this scenario, even when closing 1 every minute, we wouldn't likely want a little randomness in there as if there were two applications both running HikariCP pointed at the same database, there is a chance that they would be on identical timings which would result in a similar spiky graph as far as the database is concerned. I haven't actually thought this out but my instinct tells me that would be a concern.

Any thoughts?

My thoughts are, when utilizing this approach it may take a very long time for larger connection pools to expire connections through existing settings like "maxLifeTime". If you had a 100 connections instead of 10 connections it would take approximately 100 minutes to cycle all of your db connection pool which may not be desirable for all use cases. Hypothetically we could go with some sort config setting that allows simultaneous connections to refresh based on a percentage of total connections to increase this speed however I still think you should still expose these setting to the user to allow for flexibility across different implementations.

Irrespective of that, I feel that this approach would require a substantial but not impossible change to the Hikari connection reuse flow. Currently if a connection exceeds its configured maxLifeTime + (variance) and is no longer utilized its immediately removed from the pool and released/renewed, where this approach would be batching connections for the next cycle.

The above solution you proposed would indeed deliver an even more normalized cadence than a customizable variance however, but I think we both solutions could work with the correct hikari config

lfbayer · 2023-11-05T07:19:30Z

My gut tells me that there isn't really any reason one user would want a different configuration from another user. And by exposing it as a configuration parameter rather than just fixing the algorithm or behavior to be correct™ means that many users who would want a more smoothed out wouldn't get it without having to investigate.

One thought experiment: Is it conceivable that if we changed the default variance calculation to use your specific desired value it would actual not cause a problem for anyone else?

You mention a target variance of 30%. What would be the negative of just changing the current hardcoded variance to be 30% rather than 2.5%?

Also, what is the MaxLifeTime setting you are using currently using? Maybe it makes sense to just change the current code to increase the variance in proportion of the MaxLifeTime.

If your change was just an unofficially supported system property I would feel more comfortable merging it. But the addition of a configuration property means it becomes API and somewhat locks us in on this implementation.

ssherwood · 2024-01-18T17:54:57Z

@lfbayer this PR seems stuck unfortunately.

I opened the original request to uncouple the 2.5% variance from the hard coded value. I personally am fine with a parameter to get around the hard coded value, I am also fine with any implementation that gets us close to a better solution. I just don't want to see progress stopped without a solid alternative plan.

FYI, this problem frequently shows up in environments where applications are deployed to k8s and all scaled up at the same time. The tight variance on a short maxLifetime frequently leads to a thundering herd of connections and observable spikes in latency on performance graphs.

Avarga954 · 2024-01-18T18:46:28Z

My apologies for not following this PR closer.

I agree with @ssherwood 's point. At this point I would just like to get some solution for this merged so that others utilizing k8s deployments can mitigate thundering herds. I'm happy to refactor this PR in order accomplish this.

I have been using 30% via a forked branch for my production application successfully for about a year now. If there is consensus that changing the default system value from 2.5% to 30% is the best approach I am happy to refactor to just change this value. I understand @lfbayer 's concerns about configuration locking us into an api, personally I prefer the flexibility of configuration but I'm open to either solution as long as we can make progress here.

- Pulled changes from brettwooldridge#2189 brettwooldridge#2180 brettwooldridge#2157 brettwooldridge#2149 brettwooldridge#2147 brettwooldridge#2133 brettwooldridge#2068 brettwooldridge#2065 brettwooldridge#2038 brettwooldridge#2035 brettwooldridge#2031 brettwooldridge#2010 (aside from code found in removed test functions)

Refactoring MaxLifeTime connection variance to utilize a configurable…

a2372f8

… setting rather than a static 2.5%. 2.5% will remain the default however developers can choose to override this for a more aggressive percentage if desired.

Avarga954 mentioned this pull request Feb 1, 2023

Allow for a configurable variance on maxLifetime timeouts #2004

Open

bruce-ricard reviewed Mar 3, 2023

View reviewed changes

Updating variable name to be explicitly names default

8365f39

bruce-ricard reviewed Mar 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable MaxLifeTime Variance #2035

Configurable MaxLifeTime Variance #2035

Avarga954 commented Feb 1, 2023

Avarga954 commented Feb 1, 2023

Avarga954 commented Feb 10, 2023

bruce-ricard Mar 3, 2023

Avarga954 Mar 3, 2023

bruce-ricard Mar 3, 2023

bruce-ricard Mar 5, 2023

bruce-ricard Mar 6, 2023

sksamuel commented Apr 16, 2023

sean- commented Sep 22, 2023

lfbayer commented Nov 4, 2023

Avarga954 commented Nov 4, 2023

lfbayer commented Nov 5, 2023

ssherwood commented Jan 18, 2024

Avarga954 commented Jan 18, 2024

Configurable MaxLifeTime Variance #2035

Are you sure you want to change the base?

Configurable MaxLifeTime Variance #2035

Conversation

Avarga954 commented Feb 1, 2023

Avarga954 commented Feb 1, 2023

Avarga954 commented Feb 10, 2023

bruce-ricard Mar 3, 2023

Choose a reason for hiding this comment

Avarga954 Mar 3, 2023

Choose a reason for hiding this comment

bruce-ricard Mar 3, 2023

Choose a reason for hiding this comment

bruce-ricard Mar 5, 2023

Choose a reason for hiding this comment

bruce-ricard Mar 6, 2023

Choose a reason for hiding this comment

sksamuel commented Apr 16, 2023

sean- commented Sep 22, 2023

lfbayer commented Nov 4, 2023

Avarga954 commented Nov 4, 2023

lfbayer commented Nov 5, 2023

ssherwood commented Jan 18, 2024

Avarga954 commented Jan 18, 2024