Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeRestartWithResharding and SoftRebootNodeMonkey nemeses fail on Docker backend, being unable to successfully finish a node reboot disruption #7330

Open
dimakr opened this issue Apr 10, 2024 · 2 comments

Comments

@dimakr
Copy link
Contributor

dimakr commented Apr 10, 2024

NodeRestartWithResharding and SoftRebootNodeMonkey nemeses fail on Docker backend, when waiting for a node to return back after reboot/restart disruption.

The error during NodeRestartWithResharding nemesis:

2024-04-08 18:39:07.166: (DisruptionEvent Severity.ERROR) period_type=end event_id=71d16ff5-bb7e-44de-9259-6953d87cac2c duration=2m24s: nemesis_name=RestartWithResharding target_node=Node longevity-1gb-1h-nemesis-longevit-db-node-407ce057-1 [172.17.0.3 | 172.17.0.3] (seed: True) errors=Resharding has not been started (murmur3_partitioner_ignore_msb_bits=15) Check the log for the details
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5117, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 324, in inner_func
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1003, in disrupt_restart_with_resharding
self.target_node.restart_node_with_resharding(
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2478, in restart_node_with_resharding
raise Exception(f'Resharding has not been started '
Exception: Resharding has not been started (murmur3_partitioner_ignore_msb_bits=15) Check the log for the details

The error during SoftRebootNodeMonkey nemesis:

2024-04-09 01:35:04.396: (DisruptionEvent Severity.ERROR) period_type=end event_id=04e343e9-e83e-4a05-8f96-612bc9f84325 duration=45m10s: nemesis_name=SoftRebootNode target_node=Node longevity-1gb-1h-nemesis-longevit-db-node-a5907dd2-0 [172.17.0.2 | 172.17.0.2] (seed: True) errors=Wait for: uptime_changed: timeout - 2700 seconds - expired
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 70, in wait_for
res = retry(func, **kwargs)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
do = self.iter(retry_state=retry_state)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
raise retry_exc.reraise()
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 194, in reraise
raise self
tenacity.RetryError: RetryError[<Future at 0x7f8cdc280940 state=finished returned bool>]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5117, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 927, in disrupt_soft_reboot_node
self.reboot_node(target_node=self.target_node, hard=False)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 324, in inner_func
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3740, in reboot_node
target_node.reboot(hard=hard, verify_ssh=verify_ssh)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 1011, in reboot
wait.wait_for(func=uptime_changed, step=10, timeout=60*45, throw_exc=True)
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 86, in wait_for
raise raising_exc from ex
sdcm.exceptions.WaitForTimeoutError: Wait for: uptime_changed: timeout - 2700 seconds - expired

Installation details

SCT Version: master
Scylla version: 2024.1.2-0.20240228.2c85a811d0be
Test: longevity-5gb-1h-nemesis
Test config: configurations/nemesis/additional_configs/docker_backend_local.yaml

Logs

SoftRebootNodeMonkey Jenkins job url
NodeRestartWithResharding Jenkins job url

@dimakr dimakr removed their assignment Apr 10, 2024
@soyacz
Copy link
Contributor

soyacz commented Apr 11, 2024

I don't think this nemesis is possible with --smp 1, possibly for this one we need to increase it (or change the way we do resharding - change smp instead of changing murmur3_partitioner_ignore_msb_bits).

@soyacz
Copy link
Contributor

soyacz commented Apr 18, 2024

now I see the error sdcm.exceptions.WaitForTimeoutError: Wait for: uptime_changed: timeout - 2700 seconds - expired - in docker backend command uptime shows host uptime - so we need to reimplement this method for docker backed to take from e.g. docker ps status value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants