Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join with context cancelation #291

Open
dimitarvdimitrov opened this issue Oct 18, 2023 · 0 comments
Open

Join with context cancelation #291

dimitarvdimitrov opened this issue Oct 18, 2023 · 0 comments

Comments

@dimitarvdimitrov
Copy link

Description

The existing (*Memberlist).Join method can take a long time to complete for large clusters. The problem is exacerbated when some of the addresses to join are non-existent IPs and we end up waiting the TCPTimeout duration on each of them.

For example we've observed in grafana/mimir that a full join initiated while most of the cluster members are restarting and changing IPs may take as long as 25 minutes. Nodes which are in the middle of a (*Memberlist).Join cannot be gracefully shut down until Join returns.

Proposal

Add context.Context argument to (*Memberlist).Join and check it between pushPulling with each node.

Alternatively, if you don't want to break existing client, we can create a new method JoinContext which does the above.

I'm creating this issue to get feedback on the idea. After discussion I am happy to open a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant