-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RAFT] A new node init does not complete (test timeout after waiting for 1 hour) #7448
Comments
first things first, before anyone can say we need to change SCT timing about this operation, we need more information about when it's happening, I see this case is using tablets, so did you compare it to cases without tablets ? maybe it's some regression related to tablets (or something else) ? we need to see when it's is starting from, and what is the actual root cause, in this case we have 100Gb of data, I don't see why adding a new node would take a whole hour ? |
this is the dashboard of those operations in the last 90 days: I think most of them maps to those tablets runs |
which node was it? |
thanks @fruch , some outputs:
Is it a known issue? should i check if something is changed in SCT? or report a scylla issue? |
cross check if those problematic runs are with tablets enabled, so yes should raise a scylla issue for it. |
The issue is reproduced in the exact scenario except no tablets. |
from 18802:
@aleksbykov , please mention your SCT fix PR for this issue. |
@fruch , isn't this message wrong and misleading ES as well: |
What's misleading here exactly ? |
|
Packages
Scylla version:
5.5.0~dev-20240515.7b41bb601c53
with build-id411dbd445918e9235d0bd963183ec77729d09e12
Kernel Version:
5.15.0-1060-aws
Issue description
Describe your issue in detail and steps it took to produce it.
When node bootstrap is interrupted with RAFT - the node becomes banned and cannot join the cluster until runs a cleanup (remove host-id).
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-03e1418be972f5f58
(aws: undefined_region)Test:
longevity-100gb-4h-test
Test id:
d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3
Test name:
scylla-master/tablets/longevity-100gb-4h-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3
$ hydra investigate show-logs d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: