NodeRestartWithResharding and SoftRebootNodeMonkey nemeses fail on Docker backend, being unable to successfully finish a node reboot disruption #7330

dimakr · 2024-04-10T14:22:50Z

NodeRestartWithResharding and SoftRebootNodeMonkey nemeses fail on Docker backend, when waiting for a node to return back after reboot/restart disruption.

The error during NodeRestartWithResharding nemesis:

2024-04-08 18:39:07.166: (DisruptionEvent Severity.ERROR) period_type=end event_id=71d16ff5-bb7e-44de-9259-6953d87cac2c duration=2m24s: nemesis_name=RestartWithResharding target_node=Node longevity-1gb-1h-nemesis-longevit-db-node-407ce057-1 [172.17.0.3 | 172.17.0.3] (seed: True) errors=Resharding has not been started (murmur3_partitioner_ignore_msb_bits=15) Check the log for the details
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5117, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 324, in inner_func
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1003, in disrupt_restart_with_resharding
self.target_node.restart_node_with_resharding(
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2478, in restart_node_with_resharding
raise Exception(f'Resharding has not been started '
Exception: Resharding has not been started (murmur3_partitioner_ignore_msb_bits=15) Check the log for the details

The error during SoftRebootNodeMonkey nemesis:

2024-04-09 01:35:04.396: (DisruptionEvent Severity.ERROR) period_type=end event_id=04e343e9-e83e-4a05-8f96-612bc9f84325 duration=45m10s: nemesis_name=SoftRebootNode target_node=Node longevity-1gb-1h-nemesis-longevit-db-node-a5907dd2-0 [172.17.0.2 | 172.17.0.2] (seed: True) errors=Wait for: uptime_changed: timeout - 2700 seconds - expired
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 70, in wait_for
res = retry(func, **kwargs)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
do = self.iter(retry_state=retry_state)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
raise retry_exc.reraise()
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 194, in reraise
raise self
tenacity.RetryError: RetryError[<Future at 0x7f8cdc280940 state=finished returned bool>]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5117, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 927, in disrupt_soft_reboot_node
self.reboot_node(target_node=self.target_node, hard=False)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 324, in inner_func
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3740, in reboot_node
target_node.reboot(hard=hard, verify_ssh=verify_ssh)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 1011, in reboot
wait.wait_for(func=uptime_changed, step=10, timeout=60*45, throw_exc=True)
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 86, in wait_for
raise raising_exc from ex
sdcm.exceptions.WaitForTimeoutError: Wait for: uptime_changed: timeout - 2700 seconds - expired

Installation details

SCT Version: master
Scylla version: 2024.1.2-0.20240228.2c85a811d0be
Test: longevity-5gb-1h-nemesis
Test config: configurations/nemesis/additional_configs/docker_backend_local.yaml

Logs

SoftRebootNodeMonkey sct log: SoftRebootNodeMonkey.sct.log.tar.gz
NodeRestartWithResharding sct log: NodeRestartWithResharding.sct.log.tar.gz

SoftRebootNodeMonkey Jenkins job url
NodeRestartWithResharding Jenkins job url

The text was updated successfully, but these errors were encountered:

soyacz · 2024-04-11T14:01:38Z

I don't think this nemesis is possible with --smp 1, possibly for this one we need to increase it (or change the way we do resharding - change smp instead of changing murmur3_partitioner_ignore_msb_bits).

soyacz · 2024-04-18T06:21:21Z

now I see the error sdcm.exceptions.WaitForTimeoutError: Wait for: uptime_changed: timeout - 2700 seconds - expired - in docker backend command uptime shows host uptime - so we need to reimplement this method for docker backed to take from e.g. docker ps status value

github-actions bot assigned dimakr Apr 10, 2024

dimakr removed their assignment Apr 10, 2024

roydahan added the docker-backend label May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodeRestartWithResharding and SoftRebootNodeMonkey nemeses fail on Docker backend, being unable to successfully finish a node reboot disruption #7330

NodeRestartWithResharding and SoftRebootNodeMonkey nemeses fail on Docker backend, being unable to successfully finish a node reboot disruption #7330

dimakr commented Apr 10, 2024 •

edited

soyacz commented Apr 11, 2024

soyacz commented Apr 18, 2024

NodeRestartWithResharding and SoftRebootNodeMonkey nemeses fail on Docker backend, being unable to successfully finish a node reboot disruption #7330

NodeRestartWithResharding and SoftRebootNodeMonkey nemeses fail on Docker backend, being unable to successfully finish a node reboot disruption #7330

Comments

dimakr commented Apr 10, 2024 • edited

Installation details

Logs

soyacz commented Apr 11, 2024

soyacz commented Apr 18, 2024

dimakr commented Apr 10, 2024 •

edited