Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAFT] A new node init does not complete (test timeout after waiting for 1 hour) #7448

Open
2 tasks
yarongilor opened this issue May 19, 2024 · 11 comments
Open
2 tasks
Assignees

Comments

@yarongilor
Copy link
Contributor

yarongilor commented May 19, 2024

Packages

Scylla version: 5.5.0~dev-20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12

Kernel Version: 5.15.0-1060-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.
When node bootstrap is interrupted with RAFT - the node becomes banned and cannot join the cluster until runs a cleanup (remove host-id).

2024-05-16 10:28:33.282: (SoftTimeoutEvent Severity.ERROR) period_type=one-time event_id=875c02fd-91e3-4ed8-b64c-852b78b8ef47 during_nemesis=BootstrapStreamingError, source=SoftTimeout message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
self._bootstrap_inner()
File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/decorators.py", line 26, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 423, in run
self.disrupt()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 6306, in disrupt
self.call_next_nemesis()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1916, in call_next_nemesis
self.execute_disrupt_method(disrupt_method=next(self.disruptions_cycle))
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1842, in execute_disrupt_method
disrupt_method()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5117, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4973, in disrupt_bootstrap_streaming_error
bootstrapabortmanager.clean_and_restart_bootstrap_after_abort()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/raft/common.py", line 145, in clean_and_restart_bootstrap_after_abort
with adaptive_timeout(operation=Operations.NEW_NODE, node=self.verification_node, timeout=3600) as bootstrap_timeout:
File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/adaptive_timeouts/__init__.py", line 112, in adaptive_timeout
SoftTimeoutEvent(operation=operation.name, soft_timeout=timeout, duration=duration).publish_or_dump()

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-100gb-4h-master-db-node-d6d9cb1d-9 (63.32.105.88 | 10.4.8.149) (shards: 14)
  • longevity-100gb-4h-master-db-node-d6d9cb1d-8 (18.201.71.67 | 10.4.8.4) (shards: 14)
  • longevity-100gb-4h-master-db-node-d6d9cb1d-7 (52.50.86.221 | 10.4.9.156) (shards: -1)
  • longevity-100gb-4h-master-db-node-d6d9cb1d-6 (54.217.170.24 | 10.4.10.182) (shards: 14)
  • longevity-100gb-4h-master-db-node-d6d9cb1d-5 (34.243.191.244 | 10.4.10.170) (shards: 14)
  • longevity-100gb-4h-master-db-node-d6d9cb1d-4 (52.213.250.163 | 10.4.11.144) (shards: 14)
  • longevity-100gb-4h-master-db-node-d6d9cb1d-3 (3.248.233.136 | 10.4.9.127) (shards: 14)
  • longevity-100gb-4h-master-db-node-d6d9cb1d-2 (3.252.90.13 | 10.4.11.80) (shards: 14)
  • longevity-100gb-4h-master-db-node-d6d9cb1d-10 (3.253.29.31 | 10.4.11.207) (shards: 14)
  • longevity-100gb-4h-master-db-node-d6d9cb1d-1 (34.253.40.12 | 10.4.11.45) (shards: 14)

OS / Image: ami-03e1418be972f5f58 (aws: undefined_region)

Test: longevity-100gb-4h-test
Test id: d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3
Test name: scylla-master/tablets/longevity-100gb-4h-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3

Logs:

Jenkins job URL
Argus

@yarongilor yarongilor self-assigned this May 19, 2024
@yarongilor
Copy link
Contributor Author

@fruch , @bhalevy , any suggestion how to address that?
it is about either increase add-node timeout in few minutes or any other alternative conception.

@fruch
Copy link
Contributor

fruch commented May 19, 2024

@fruch , @bhalevy , any suggestion how to address that? it is about either increase add-node timeout in few minutes or any other alternative conception.

first things first, before anyone can say we need to change SCT timing about this operation, we need more information about when it's happening, I see this case is using tablets, so did you compare it to cases without tablets ? maybe it's some regression related to tablets (or something else) ?

we need to see when it's is starting from, and what is the actual root cause, in this case we have 100Gb of data, I don't see why adding a new node would take a whole hour ?

@fruch
Copy link
Contributor

fruch commented May 19, 2024

this is the dashboard of those operations in the last 90 days:
https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5

I think most of them maps to those tablets runs

@bhalevy
Copy link
Member

bhalevy commented May 19, 2024

which node was it?
what does clean_and_restart_bootstrap_after_abort do exactly?
maybe it has pending migrations in progress that are lingering after abort if the new restarts with a new identity
and we should remove the previous identity with nodetool removenode
Cc @tgrabiec

@yarongilor
Copy link
Contributor Author

yarongilor commented May 21, 2024

this is the dashboard of those operations in the last 90 days: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5

I think most of them maps to those tablets runs

thanks @fruch , some outputs:
basically, the statistics shows add-node average duration for this test is 200 - 400 seconds.
It seems there's a significant degradation on runs with tablets, during the ~ last month - 10 times it got a > 3600 duration:

operation | duration | scylla_version
-- | -- | --
[past versions worst results]
NEW_NODE | 1317.537 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1212.539 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1386.438 | 2023.2.0~dev-0.20230404.85ad4482e7af with build-id c0ff5a80b9d10766ba9d6d681fc9893503c8f3f9
NEW_NODE | 1027.701 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1434.919 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1501.489 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1620.543 | 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2

[last month worst results]
NEW_NODE | 3620.362 | 5.5.0~dev-0.20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b
NEW_NODE | 3626.187 | 5.5.0~dev-0.20240321.900b56b117a1 with build-id b3a288bc50af2ce9c89155e61be692df05c17bf1
NEW_NODE | 3620.474 | 2024.2.0~dev-0.20240311.e59273035524 with build-id 9fb40ab71eca07f7202bb20528de3e0b83eaa0fa
NEW_NODE | 3620.25 | 5.5.0~dev-0.20240324.6bdb456fadfe with build-id 61f172ff915d5dd702951a0b7c0e83ac33712635
NEW_NODE | 3624.812 | 5.5.0~dev-0.20240329.885cb2af07b8 with build-id 4d1fc3fe8868b3d00a42f2b0b51f9953e9fa7346
NEW_NODE | 3624.183 | 5.5.0~dev-0.20240405.0c74c2c12fa4 with build-id ff77798ae700ea4f0d1bc44532624b330b89009a
NEW_NODE | 3617.742 | 5.5.0~dev-0.20240411.0be61e51d329 with build-id 54e018bb43eb2ae707aac127097747cd39532532
NEW_NODE | 3620.277 | 5.5.0~dev-0.20240501.af5674211dd4 with build-id aade02eeb28f730331d6170eaadf364591e0a779
NEW_NODE | 3619.354 | 5.5.0~dev-0.20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12
NEW_NODE | 3623.813 | 5.5.0~dev-0.20240517.db70e8dd5f6d with build-id 0a72505be9d1e39577898f683b53c462d3844ad8

Is it a known issue? should i check if something is changed in SCT? or report a scylla issue?

@fruch
Copy link
Contributor

fruch commented May 21, 2024

this is the dashboard of those operations in the last 90 days: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5
I think most of them maps to those tablets runs

thanks @fruch , some outputs: basically, the statistics shows add-node average duration for this test is 200 - 400 seconds. It seems there's a significant degradation on runs with tablets, during the ~ last month - 10 times it got a > 3600 duration:

operation | duration | scylla_version
-- | -- | --
[past versions worst results]
NEW_NODE | 1317.537 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1212.539 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1386.438 | 2023.2.0~dev-0.20230404.85ad4482e7af with build-id c0ff5a80b9d10766ba9d6d681fc9893503c8f3f9
NEW_NODE | 1027.701 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1434.919 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1501.489 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1620.543 | 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2

[last month worst results]
NEW_NODE | 3620.362 | 5.5.0~dev-0.20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b
NEW_NODE | 3626.187 | 5.5.0~dev-0.20240321.900b56b117a1 with build-id b3a288bc50af2ce9c89155e61be692df05c17bf1
NEW_NODE | 3620.474 | 2024.2.0~dev-0.20240311.e59273035524 with build-id 9fb40ab71eca07f7202bb20528de3e0b83eaa0fa
NEW_NODE | 3620.25 | 5.5.0~dev-0.20240324.6bdb456fadfe with build-id 61f172ff915d5dd702951a0b7c0e83ac33712635
NEW_NODE | 3624.812 | 5.5.0~dev-0.20240329.885cb2af07b8 with build-id 4d1fc3fe8868b3d00a42f2b0b51f9953e9fa7346
NEW_NODE | 3624.183 | 5.5.0~dev-0.20240405.0c74c2c12fa4 with build-id ff77798ae700ea4f0d1bc44532624b330b89009a
NEW_NODE | 3617.742 | 5.5.0~dev-0.20240411.0be61e51d329 with build-id 54e018bb43eb2ae707aac127097747cd39532532
NEW_NODE | 3620.277 | 5.5.0~dev-0.20240501.af5674211dd4 with build-id aade02eeb28f730331d6170eaadf364591e0a779
NEW_NODE | 3619.354 | 5.5.0~dev-0.20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12
NEW_NODE | 3623.813 | 5.5.0~dev-0.20240517.db70e8dd5f6d with build-id 0a72505be9d1e39577898f683b53c462d3844ad8

Is it a known issue? should i check if something is changed in SCT? or report a scylla issue?

cross check if those problematic runs are with tablets enabled, so yes should raise a scylla issue for it.

@yarongilor
Copy link
Contributor Author

this is the dashboard of those operations in the last 90 days: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5
I think most of them maps to those tablets runs

thanks @fruch , some outputs: basically, the statistics shows add-node average duration for this test is 200 - 400 seconds. It seems there's a significant degradation on runs with tablets, during the ~ last month - 10 times it got a > 3600 duration:

operation | duration | scylla_version
-- | -- | --
[past versions worst results]
NEW_NODE | 1317.537 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1212.539 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1386.438 | 2023.2.0~dev-0.20230404.85ad4482e7af with build-id c0ff5a80b9d10766ba9d6d681fc9893503c8f3f9
NEW_NODE | 1027.701 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1434.919 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1501.489 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1620.543 | 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2

[last month worst results]
NEW_NODE | 3620.362 | 5.5.0~dev-0.20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b
NEW_NODE | 3626.187 | 5.5.0~dev-0.20240321.900b56b117a1 with build-id b3a288bc50af2ce9c89155e61be692df05c17bf1
NEW_NODE | 3620.474 | 2024.2.0~dev-0.20240311.e59273035524 with build-id 9fb40ab71eca07f7202bb20528de3e0b83eaa0fa
NEW_NODE | 3620.25 | 5.5.0~dev-0.20240324.6bdb456fadfe with build-id 61f172ff915d5dd702951a0b7c0e83ac33712635
NEW_NODE | 3624.812 | 5.5.0~dev-0.20240329.885cb2af07b8 with build-id 4d1fc3fe8868b3d00a42f2b0b51f9953e9fa7346
NEW_NODE | 3624.183 | 5.5.0~dev-0.20240405.0c74c2c12fa4 with build-id ff77798ae700ea4f0d1bc44532624b330b89009a
NEW_NODE | 3617.742 | 5.5.0~dev-0.20240411.0be61e51d329 with build-id 54e018bb43eb2ae707aac127097747cd39532532
NEW_NODE | 3620.277 | 5.5.0~dev-0.20240501.af5674211dd4 with build-id aade02eeb28f730331d6170eaadf364591e0a779
NEW_NODE | 3619.354 | 5.5.0~dev-0.20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12
NEW_NODE | 3623.813 | 5.5.0~dev-0.20240517.db70e8dd5f6d with build-id 0a72505be9d1e39577898f683b53c462d3844ad8

Is it a known issue? should i check if something is changed in SCT? or report a scylla issue?

cross check if those problematic runs are with tablets enabled, so yes should raise a scylla issue for it.

The issue is reproduced in the exact scenario except no tablets.
opened scylladb/scylladb#18802

@yarongilor
Copy link
Contributor Author

from 18802:

Opened documentation issue: #18850

Other than that, the SCT nemeses will have to be adjusted to account for node ban. @yarongilor @fruch

We similarly adjusted dtests, see for example: #17385, scylladb/scylla-dtest#3959

@aleksbykov , please mention your SCT fix PR for this issue.

@yarongilor
Copy link
Contributor Author

@fruch , isn't this message wrong and misleading ES as well: message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s ? do we want to fix this as well? so that ES would not be falsely reported for that?

@fruch
Copy link
Contributor

fruch commented May 27, 2024

@fruch , isn't this message wrong and misleading ES as well: message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s ? do we want to fix this as well? so that ES would not be falsely reported for that?

What's misleading here exactly ?

@yarongilor
Copy link
Contributor Author

@fruch , isn't this message wrong and misleading ES as well: message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s ? do we want to fix this as well? so that ES would not be falsely reported for that?

What's misleading here exactly ?

operation 'NEW_NODE' took 3619.353437900543s - not really true (or meaningless). The bootstrap never ended.

@fruch fruch changed the title SoftTimeout message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s A new node init does not complete (test timeout after waiting for 1 hour) May 27, 2024
@yarongilor yarongilor changed the title A new node init does not complete (test timeout after waiting for 1 hour) [RAFT] A new node init does not complete (test timeout after waiting for 1 hour) May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants