Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable MaxLifeTime Variance #2035

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

Avarga954
Copy link

Refactoring MaxLifeTime connection variance to utilize a configurable setting rather than a static 2.5%.
2.5% will remain the default however developers can choose to override this for a more aggressive percentage if desired.

… setting rather than a static 2.5%.

2.5% will remain the default however developers can choose to override this for a more aggressive percentage if desired.
@Avarga954
Copy link
Author

Just wanted to leave a comment linking back to the open issue where I discussed the problem a little bit more at length.

#2004

@Avarga954
Copy link
Author

Here is an example of the default hikari variance that has a maximum of 2.5% variance
Screenshot 2023-02-10 at 1 20 17 PM

Here is an example with a much more aggresive maximum variance of 30%
Screenshot 2023-02-10 at 1 20 37 PM

The large spikes / mass closing of connections is much more smoothed out.

@@ -53,6 +53,7 @@ public class HikariConfig implements HikariConfigMXBean
private static final long SOFT_TIMEOUT_FLOOR = Long.getLong("com.zaxxer.hikari.timeoutMs.floor", 250L);
private static final long IDLE_TIMEOUT = MINUTES.toMillis(10);
private static final long MAX_LIFETIME = MINUTES.toMillis(30);
private static final double MAX_LIFETIME_VARIANCE = 2.5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this constant should be called DEFAULT_MAX_LIFETIME_VARIANCE instead? As it appears to be the default value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, went ahead and changed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to give you false expectations that your PR will be merged soon. I don't have push permissions. I'm just an open source engineer happening to look at this PR, because I want to see how reactive the maintainers are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thank you for your kind response to my review. I wish more developers appreciated code review suggestions :).

@@ -66,6 +67,7 @@ public class HikariConfig implements HikariConfigMXBean
private volatile long idleTimeout;
private volatile long leakDetectionThreshold;
private volatile long maxLifetime;
private volatile double maxLifetimeVariance;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally really try to avoid floating point numbers as much as I can, as they cause a lot of trouble. ( What Every Computer Scientist Should Know About Floating-Point Arithmetic )

I understand why you went for them: you need to be able to represent a value like 2.5%, and it has a decimal point, which makes them not representable by integers.

I have a suggestion. How about you use per mille to represent the variance instead of percent. That way, you could change your type to int, which is a lot more friendly than double, and you'll be able to represent 2.5% as 25‰. If you think that we'll need more precision than that, you can change the type to make it more precise. Like "per million" for example. Though IMO "per mille" is enough.

Please let me know your thoughts.

@sksamuel
Copy link

Would love to see this merged.

@sean-
Copy link

sean- commented Sep 22, 2023

Can someone please bless and merge this PR? Thundering herds are problematic and can only be prevented by the clients staggering their connections.

@lfbayer
Copy link
Collaborator

lfbayer commented Nov 4, 2023

I think I would prefer a solution to this problem that doesn't require the user to configure anything. I don't see any reason why we wouldn't want to avoid these timing issues for everyone.

It seems like there has to be a way to implement this that would allow the connection close timings to be as near a flat graph as possible.

If, for example, the lifetime was 10 minutes and the connection count is 10, to be flat we would want a connection closed once per minute (rather than 10 near the end of 10 minutes). If the system is up and running with a fixed size pool this would be a fairly straight forward. But even with a fixed size pool, you don't really want a connection closed after exactly one minute of running after startup. So there would be some heuristic to ease into the target flat graph. In the case of a non-fixed size pool it might be a little less straight forward.

In this scenario, even when closing 1 every minute, we wouldn't likely want a little randomness in there as if there were two applications both running HikariCP pointed at the same database, there is a chance that they would be on identical timings which would result in a similar spiky graph as far as the database is concerned. I haven't actually thought this out but my instinct tells me that would be a concern.

Any thoughts?

@Avarga954
Copy link
Author

I think I would prefer a solution to this problem that doesn't require the user to configure anything. I don't see any reason why we wouldn't want to avoid these timing issues for everyone.

My thought process for this PR was to make this configuration option transparent by default as not to impact any existing deployments existing Hikari config settings on upgrade. However this could be achieved by simply changing the default value I provided to be something more aggressive than 2.5% which is what it has OOTB today.

It seems like there has to be a way to implement this that would allow the connection close timings to be as near a flat graph as possible.

If, for example, the lifetime was 10 minutes and the connection count is 10, to be flat we would want a connection closed once per minute (rather than 10 near the end of 10 minutes). If the system is up and running with a fixed size pool this would be a fairly straight forward. But even with a fixed size pool, you don't really want a connection closed after exactly one minute of running after startup. So there would be some heuristic to ease into the target flat graph. In the case of a non-fixed size pool it might be a little less straight forward.

In this scenario, even when closing 1 every minute, we wouldn't likely want a little randomness in there as if there were two applications both running HikariCP pointed at the same database, there is a chance that they would be on identical timings which would result in a similar spiky graph as far as the database is concerned. I haven't actually thought this out but my instinct tells me that would be a concern.

Any thoughts?

My thoughts are, when utilizing this approach it may take a very long time for larger connection pools to expire connections through existing settings like "maxLifeTime". If you had a 100 connections instead of 10 connections it would take approximately 100 minutes to cycle all of your db connection pool which may not be desirable for all use cases. Hypothetically we could go with some sort config setting that allows simultaneous connections to refresh based on a percentage of total connections to increase this speed however I still think you should still expose these setting to the user to allow for flexibility across different implementations.

Irrespective of that, I feel that this approach would require a substantial but not impossible change to the Hikari connection reuse flow. Currently if a connection exceeds its configured maxLifeTime + (variance) and is no longer utilized its immediately removed from the pool and released/renewed, where this approach would be batching connections for the next cycle.

The above solution you proposed would indeed deliver an even more normalized cadence than a customizable variance however, but I think we both solutions could work with the correct hikari config

@lfbayer
Copy link
Collaborator

lfbayer commented Nov 5, 2023

My gut tells me that there isn't really any reason one user would want a different configuration from another user. And by exposing it as a configuration parameter rather than just fixing the algorithm or behavior to be correct™ means that many users who would want a more smoothed out wouldn't get it without having to investigate.

One thought experiment: Is it conceivable that if we changed the default variance calculation to use your specific desired value it would actual not cause a problem for anyone else?

You mention a target variance of 30%. What would be the negative of just changing the current hardcoded variance to be 30% rather than 2.5%?

Also, what is the MaxLifeTime setting you are using currently using? Maybe it makes sense to just change the current code to increase the variance in proportion of the MaxLifeTime.

If your change was just an unofficially supported system property I would feel more comfortable merging it. But the addition of a configuration property means it becomes API and somewhat locks us in on this implementation.

@ssherwood
Copy link

@lfbayer this PR seems stuck unfortunately.

I opened the original request to uncouple the 2.5% variance from the hard coded value. I personally am fine with a parameter to get around the hard coded value, I am also fine with any implementation that gets us close to a better solution. I just don't want to see progress stopped without a solid alternative plan.

FYI, this problem frequently shows up in environments where applications are deployed to k8s and all scaled up at the same time. The tight variance on a short maxLifetime frequently leads to a thundering herd of connections and observable spikes in latency on performance graphs.

@Avarga954
Copy link
Author

My apologies for not following this PR closer.

I agree with @ssherwood 's point. At this point I would just like to get some solution for this merged so that others utilizing k8s deployments can mitigate thundering herds. I'm happy to refactor this PR in order accomplish this.

I have been using 30% via a forked branch for my production application successfully for about a year now. If there is consensus that changing the default system value from 2.5% to 30% is the best approach I am happy to refactor to just change this value. I understand @lfbayer 's concerns about configuration locking us into an api, personally I prefer the flexibility of configuration but I'm open to either solution as long as we can make progress here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants