[RAFT] A new node init does not complete (test timeout after waiting for 1 hour) #7448

yarongilor · 2024-05-19T09:18:42Z

Packages

Scylla version: 5.5.0~dev-20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12

Kernel Version: 5.15.0-1060-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.
When node bootstrap is interrupted with RAFT - the node becomes banned and cannot join the cluster until runs a cleanup (remove host-id).

2024-05-16 10:28:33.282: (SoftTimeoutEvent Severity.ERROR) period_type=one-time event_id=875c02fd-91e3-4ed8-b64c-852b78b8ef47 during_nemesis=BootstrapStreamingError, source=SoftTimeout message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
self._bootstrap_inner()
File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/decorators.py", line 26, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 423, in run
self.disrupt()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 6306, in disrupt
self.call_next_nemesis()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1916, in call_next_nemesis
self.execute_disrupt_method(disrupt_method=next(self.disruptions_cycle))
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1842, in execute_disrupt_method
disrupt_method()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5117, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4973, in disrupt_bootstrap_streaming_error
bootstrapabortmanager.clean_and_restart_bootstrap_after_abort()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/raft/common.py", line 145, in clean_and_restart_bootstrap_after_abort
with adaptive_timeout(operation=Operations.NEW_NODE, node=self.verification_node, timeout=3600) as bootstrap_timeout:
File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/adaptive_timeouts/__init__.py", line 112, in adaptive_timeout
SoftTimeoutEvent(operation=operation.name, soft_timeout=timeout, duration=duration).publish_or_dump()

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-100gb-4h-master-db-node-d6d9cb1d-9 (63.32.105.88 | 10.4.8.149) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-8 (18.201.71.67 | 10.4.8.4) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-7 (52.50.86.221 | 10.4.9.156) (shards: -1)
longevity-100gb-4h-master-db-node-d6d9cb1d-6 (54.217.170.24 | 10.4.10.182) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-5 (34.243.191.244 | 10.4.10.170) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-4 (52.213.250.163 | 10.4.11.144) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-3 (3.248.233.136 | 10.4.9.127) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-2 (3.252.90.13 | 10.4.11.80) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-10 (3.253.29.31 | 10.4.11.207) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-1 (34.253.40.12 | 10.4.11.45) (shards: 14)

OS / Image: ami-03e1418be972f5f58 (aws: undefined_region)

Test: longevity-100gb-4h-test
Test id: d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3
Test name: scylla-master/tablets/longevity-100gb-4h-test
Test config file(s):

longevity-100gb-4h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3

Logs:

db-cluster-d6d9cb1d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/db-cluster-d6d9cb1d.tar.gz
sct-runner-events-d6d9cb1d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/sct-runner-events-d6d9cb1d.tar.gz
sct-d6d9cb1d.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/sct-d6d9cb1d.log.tar.gz
loader-set-d6d9cb1d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/loader-set-d6d9cb1d.tar.gz
monitor-set-d6d9cb1d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/monitor-set-d6d9cb1d.tar.gz
parallel-timelines-report-d6d9cb1d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/parallel-timelines-report-d6d9cb1d.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

yarongilor · 2024-05-19T09:21:44Z

@fruch , @bhalevy , any suggestion how to address that?
it is about either increase add-node timeout in few minutes or any other alternative conception.

fruch · 2024-05-19T09:58:03Z

@fruch , @bhalevy , any suggestion how to address that? it is about either increase add-node timeout in few minutes or any other alternative conception.

first things first, before anyone can say we need to change SCT timing about this operation, we need more information about when it's happening, I see this case is using tablets, so did you compare it to cases without tablets ? maybe it's some regression related to tablets (or something else) ?

we need to see when it's is starting from, and what is the actual root cause, in this case we have 100Gb of data, I don't see why adding a new node would take a whole hour ?

fruch · 2024-05-19T10:19:28Z

this is the dashboard of those operations in the last 90 days:
https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5

I think most of them maps to those tablets runs

bhalevy · 2024-05-19T14:10:32Z

which node was it?
what does clean_and_restart_bootstrap_after_abort do exactly?
maybe it has pending migrations in progress that are lingering after abort if the new restarts with a new identity
and we should remove the previous identity with nodetool removenode
Cc @tgrabiec

yarongilor · 2024-05-21T07:40:22Z

this is the dashboard of those operations in the last 90 days: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5

I think most of them maps to those tablets runs

thanks @fruch , some outputs:
basically, the statistics shows add-node average duration for this test is 200 - 400 seconds.
It seems there's a significant degradation on runs with tablets, during the ~ last month - 10 times it got a > 3600 duration:

operation | duration | scylla_version
-- | -- | --
[past versions worst results]
NEW_NODE | 1317.537 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1212.539 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1386.438 | 2023.2.0~dev-0.20230404.85ad4482e7af with build-id c0ff5a80b9d10766ba9d6d681fc9893503c8f3f9
NEW_NODE | 1027.701 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1434.919 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1501.489 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1620.543 | 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2

[last month worst results]
NEW_NODE | 3620.362 | 5.5.0~dev-0.20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b
NEW_NODE | 3626.187 | 5.5.0~dev-0.20240321.900b56b117a1 with build-id b3a288bc50af2ce9c89155e61be692df05c17bf1
NEW_NODE | 3620.474 | 2024.2.0~dev-0.20240311.e59273035524 with build-id 9fb40ab71eca07f7202bb20528de3e0b83eaa0fa
NEW_NODE | 3620.25 | 5.5.0~dev-0.20240324.6bdb456fadfe with build-id 61f172ff915d5dd702951a0b7c0e83ac33712635
NEW_NODE | 3624.812 | 5.5.0~dev-0.20240329.885cb2af07b8 with build-id 4d1fc3fe8868b3d00a42f2b0b51f9953e9fa7346
NEW_NODE | 3624.183 | 5.5.0~dev-0.20240405.0c74c2c12fa4 with build-id ff77798ae700ea4f0d1bc44532624b330b89009a
NEW_NODE | 3617.742 | 5.5.0~dev-0.20240411.0be61e51d329 with build-id 54e018bb43eb2ae707aac127097747cd39532532
NEW_NODE | 3620.277 | 5.5.0~dev-0.20240501.af5674211dd4 with build-id aade02eeb28f730331d6170eaadf364591e0a779
NEW_NODE | 3619.354 | 5.5.0~dev-0.20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12
NEW_NODE | 3623.813 | 5.5.0~dev-0.20240517.db70e8dd5f6d with build-id 0a72505be9d1e39577898f683b53c462d3844ad8

Is it a known issue? should i check if something is changed in SCT? or report a scylla issue?

fruch · 2024-05-21T09:36:36Z

this is the dashboard of those operations in the last 90 days: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5
I think most of them maps to those tablets runs

thanks @fruch , some outputs: basically, the statistics shows add-node average duration for this test is 200 - 400 seconds. It seems there's a significant degradation on runs with tablets, during the ~ last month - 10 times it got a > 3600 duration:

operation | duration | scylla_version
-- | -- | --
[past versions worst results]
NEW_NODE | 1317.537 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1212.539 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1386.438 | 2023.2.0~dev-0.20230404.85ad4482e7af with build-id c0ff5a80b9d10766ba9d6d681fc9893503c8f3f9
NEW_NODE | 1027.701 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1434.919 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1501.489 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1620.543 | 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2

[last month worst results]
NEW_NODE | 3620.362 | 5.5.0~dev-0.20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b
NEW_NODE | 3626.187 | 5.5.0~dev-0.20240321.900b56b117a1 with build-id b3a288bc50af2ce9c89155e61be692df05c17bf1
NEW_NODE | 3620.474 | 2024.2.0~dev-0.20240311.e59273035524 with build-id 9fb40ab71eca07f7202bb20528de3e0b83eaa0fa
NEW_NODE | 3620.25 | 5.5.0~dev-0.20240324.6bdb456fadfe with build-id 61f172ff915d5dd702951a0b7c0e83ac33712635
NEW_NODE | 3624.812 | 5.5.0~dev-0.20240329.885cb2af07b8 with build-id 4d1fc3fe8868b3d00a42f2b0b51f9953e9fa7346
NEW_NODE | 3624.183 | 5.5.0~dev-0.20240405.0c74c2c12fa4 with build-id ff77798ae700ea4f0d1bc44532624b330b89009a
NEW_NODE | 3617.742 | 5.5.0~dev-0.20240411.0be61e51d329 with build-id 54e018bb43eb2ae707aac127097747cd39532532
NEW_NODE | 3620.277 | 5.5.0~dev-0.20240501.af5674211dd4 with build-id aade02eeb28f730331d6170eaadf364591e0a779
NEW_NODE | 3619.354 | 5.5.0~dev-0.20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12
NEW_NODE | 3623.813 | 5.5.0~dev-0.20240517.db70e8dd5f6d with build-id 0a72505be9d1e39577898f683b53c462d3844ad8

Is it a known issue? should i check if something is changed in SCT? or report a scylla issue?

cross check if those problematic runs are with tablets enabled, so yes should raise a scylla issue for it.

yarongilor · 2024-05-21T15:08:05Z

this is the dashboard of those operations in the last 90 days: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5
I think most of them maps to those tablets runs

thanks @fruch , some outputs: basically, the statistics shows add-node average duration for this test is 200 - 400 seconds. It seems there's a significant degradation on runs with tablets, during the ~ last month - 10 times it got a > 3600 duration:

operation | duration | scylla_version
-- | -- | --
[past versions worst results]
NEW_NODE | 1317.537 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1212.539 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1386.438 | 2023.2.0~dev-0.20230404.85ad4482e7af with build-id c0ff5a80b9d10766ba9d6d681fc9893503c8f3f9
NEW_NODE | 1027.701 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1434.919 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1501.489 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1620.543 | 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2

[last month worst results]
NEW_NODE | 3620.362 | 5.5.0~dev-0.20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b
NEW_NODE | 3626.187 | 5.5.0~dev-0.20240321.900b56b117a1 with build-id b3a288bc50af2ce9c89155e61be692df05c17bf1
NEW_NODE | 3620.474 | 2024.2.0~dev-0.20240311.e59273035524 with build-id 9fb40ab71eca07f7202bb20528de3e0b83eaa0fa
NEW_NODE | 3620.25 | 5.5.0~dev-0.20240324.6bdb456fadfe with build-id 61f172ff915d5dd702951a0b7c0e83ac33712635
NEW_NODE | 3624.812 | 5.5.0~dev-0.20240329.885cb2af07b8 with build-id 4d1fc3fe8868b3d00a42f2b0b51f9953e9fa7346
NEW_NODE | 3624.183 | 5.5.0~dev-0.20240405.0c74c2c12fa4 with build-id ff77798ae700ea4f0d1bc44532624b330b89009a
NEW_NODE | 3617.742 | 5.5.0~dev-0.20240411.0be61e51d329 with build-id 54e018bb43eb2ae707aac127097747cd39532532
NEW_NODE | 3620.277 | 5.5.0~dev-0.20240501.af5674211dd4 with build-id aade02eeb28f730331d6170eaadf364591e0a779
NEW_NODE | 3619.354 | 5.5.0~dev-0.20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12
NEW_NODE | 3623.813 | 5.5.0~dev-0.20240517.db70e8dd5f6d with build-id 0a72505be9d1e39577898f683b53c462d3844ad8

Is it a known issue? should i check if something is changed in SCT? or report a scylla issue?

cross check if those problematic runs are with tablets enabled, so yes should raise a scylla issue for it.

The issue is reproduced in the exact scenario except no tablets.
opened scylladb/scylladb#18802

yarongilor · 2024-05-27T08:29:21Z

from 18802:

Opened documentation issue: #18850

Other than that, the SCT nemeses will have to be adjusted to account for node ban. @yarongilor @fruch

We similarly adjusted dtests, see for example: #17385, scylladb/scylla-dtest#3959

@aleksbykov , please mention your SCT fix PR for this issue.

yarongilor · 2024-05-27T08:33:19Z

@fruch , isn't this message wrong and misleading ES as well: message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s ? do we want to fix this as well? so that ES would not be falsely reported for that?

fruch · 2024-05-27T12:45:59Z

@fruch , isn't this message wrong and misleading ES as well: message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s ? do we want to fix this as well? so that ES would not be falsely reported for that?

What's misleading here exactly ?

yarongilor · 2024-05-27T13:25:52Z

@fruch , isn't this message wrong and misleading ES as well: message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s ? do we want to fix this as well? so that ES would not be falsely reported for that?

What's misleading here exactly ?

operation 'NEW_NODE' took 3619.353437900543s - not really true (or meaningless). The bootstrap never ended.

yarongilor self-assigned this May 19, 2024

bhalevy mentioned this issue May 22, 2024

A new node init does not complete (test timeout after waiting for 1 hour) scylladb/scylladb#18802

Closed

2 tasks

fruch assigned aleksbykov May 27, 2024

fruch changed the title ~~SoftTimeout message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s~~ A new node init does not complete (test timeout after waiting for 1 hour) May 27, 2024

yarongilor changed the title ~~A new node init does not complete (test timeout after waiting for 1 hour)~~ [RAFT] A new node init does not complete (test timeout after waiting for 1 hour) May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RAFT] A new node init does not complete (test timeout after waiting for 1 hour) #7448

[RAFT] A new node init does not complete (test timeout after waiting for 1 hour) #7448

yarongilor commented May 19, 2024 •

edited

Logs:

yarongilor commented May 19, 2024

fruch commented May 19, 2024

fruch commented May 19, 2024

bhalevy commented May 19, 2024

yarongilor commented May 21, 2024 •

edited

fruch commented May 21, 2024

yarongilor commented May 21, 2024

yarongilor commented May 27, 2024

yarongilor commented May 27, 2024

fruch commented May 27, 2024

yarongilor commented May 27, 2024

[RAFT] A new node init does not complete (test timeout after waiting for 1 hour) #7448

[RAFT] A new node init does not complete (test timeout after waiting for 1 hour) #7448

Comments

yarongilor commented May 19, 2024 • edited

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

yarongilor commented May 19, 2024

fruch commented May 19, 2024

fruch commented May 19, 2024

bhalevy commented May 19, 2024

yarongilor commented May 21, 2024 • edited

fruch commented May 21, 2024

yarongilor commented May 21, 2024

yarongilor commented May 27, 2024

yarongilor commented May 27, 2024

fruch commented May 27, 2024

yarongilor commented May 27, 2024

yarongilor commented May 19, 2024 •

edited

yarongilor commented May 21, 2024 •

edited