-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Couchbase health check fails due to timeout #14685
Comments
Are you referring to #13879? That’s the opposite problem where Couchbase is down and the health indicator would hang. In your case, if you are certain that Couchbase is up, then it’s taken too long to respond and the indicator thinks it’s down. You can use |
I played around with various different timeout values and I have to change it to |
Thanks. I'd be rather concerned about the performance of your Couchbase cluster if a timeout of 60000ms still results in occasional issues. Do you see similar response times for application queries against the cluster? With regards to the management timeout, Couchbase's docs have this to say about it and the 75000ms default:
|
No, performance wise cluster is very responsive queries result are usually <50ms what I have seen is, when couchbase is perform some operations like compacting bucket, updating indexes that's when health check timeout occurs. |
I wonder if we shouldn't be using
|
The problem with something that'll take a minute or more (either for a single call, or multiple calls that retry with a timeout that backs off), is that the caller of the health endpoint is going to have to wait for a minute or more for a response. I wouldn't be surprised if a load balancer gave up before a minute had elapsed and assumed that the application was down. To be useful, I really think we need to find something that gives a reasonable impression of Couchbase's health but also responds quickly. |
I agree. |
Having installed a couple of Couchbase nodes, I've learned that we should be using
One downside of using the diagnostics report is that it considers the cluster as a whole, irrespective of what buckets that application is used and how they are replicated across the cluster. It could be that all of the nodes that host the buckets that the application is using are up and yet a node in the cluster may be down and we'd then consider Couchbase to be down unnecessarily. However, I don't think there's any way for us to determine that without doing something at the bucket level and those calls can all block for an unacceptably long time. We now need to figure out how to get from where we are now to where we want to be. The move to using |
Using the |
pinging @daschl for his insight here, but I think the In a Couchbase The healthcheck could also be made configurable to let users decide if for they workload the other types of services are relevant (see enum NB: It seems the |
Thank you, @simonbasle. The RFC is an interesting read. In particular, this section caught my attention:
I’m not sure that we’re any better-placed than the SDK is to know whether or not things are healthy for a user’s specific workload. Hopefully @daschl will have some input that proves me wrong. |
@wilkinsona I think using the So I think using the report is an accurate picture of the state of the SDK. One thing thing to consider is that even if it is "cluster scope", it pretty much affects every bucket in the same way since the data is distributed evenly across the cluster. If you want to get a good aggregated state, I think the best shot is the following algorithm: For every |
Here's a response from a cluster with a single node that's up:
And a single node that's down:
Two nodes that are both up:
And two nodes where one is up and one is down:
Now we just need to decide how to move to this new model in 2.0.x. The current implementation of the above is a completely new health indicator. This is technically a breaking change (the details of the health response are different and the type of the bean that's auto-configured has changed) but I can't see a way to fix this out of the box without making some form of breaking change. |
Spring boot - 2.0.5.RELEASE
I see in release notes a similar issues was supposed to be resolved. However, I am seeing this quite frequently(every 30min-60min) in logs, service become unhealthy and gets back healthy in about 30sec or so.
The text was updated successfully, but these errors were encountered: