Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle connection resets in the status server more gracefully #5881

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pevogam
Copy link
Contributor

@pevogam pevogam commented Mar 14, 2024

Connection resets due to network instability can lead to the status server not catching a test status, an asyncio error like

full.log│2024-03-08 03:08:44,053 asyncio base_events      L1744 ERROR| Task exception was never retrieved                                                                                                                                  │
full.log│future: <Task finished name='Task-2038' coro=<StatusServer.cb() done, defined at /usr/lib/python3.10/site-packages/avocado/core/status/server.py:51> exception=ConnectionResetError(104, 'Connection reset by peer')>             ├
full.log│Traceback (most recent call last):                                                                                                                                                                                                │
full.log│  File "/usr/lib/python3.10/site-packages/avocado/core/status/server.py", line 53, in cb                                                                                                                                          │
full.log│    raw_message = await reader.readline()                                                                                                                                                                                         │
full.log│  File "/usr/lib64/python3.10/asyncio/streams.py", line 525, in readline                                                                                                                                                          │
full.log│    line = await self.readuntil(sep)                                                                                                                                                                                              │
full.log│  File "/usr/lib64/python3.10/asyncio/streams.py", line 617, in readuntil                                                                                                                                                         │
full.log│    await self._wait_for_data('readuntil')                                                                                                                                                                                        │
full.log│  File "/usr/lib64/python3.10/asyncio/streams.py", line 502, in _wait_for_data                                                                                                                                                    │
full.log│    await self._waiter                                                                                                                                                                                                            │
full.log│  File "/usr/lib64/python3.10/asyncio/selector_events.py", line 854, in _read_ready__data_received                                                                                                                                │
full.log│    data = self._sock.recv(self.max_size)                                                                                                                                                                                         │
full.log│ConnectionResetError: [Errno 104] Connection reset by peer

and worst yet to test tasks hanging indefinitely without the job ever completing properly. This was mostly observed in cases of LXC and remote spawner isolation where the isolated task process completes but the task on the side of the task machine remains unfinished.

Connection resets due to network instability can lead to the status
server not catching a test status, an asyncio error like

full.log│2024-03-08 03:08:44,053 asyncio base_events      L1744 ERROR| Task exception was never retrieved                                                                                                                                  │
full.log│future: <Task finished name='Task-2038' coro=<StatusServer.cb() done, defined at /usr/lib/python3.10/site-packages/avocado/core/status/server.py:51> exception=ConnectionResetError(104, 'Connection reset by peer')>             ├
full.log│Traceback (most recent call last):                                                                                                                                                                                                │
full.log│  File "/usr/lib/python3.10/site-packages/avocado/core/status/server.py", line 53, in cb                                                                                                                                          │
full.log│    raw_message = await reader.readline()                                                                                                                                                                                         │
full.log│  File "/usr/lib64/python3.10/asyncio/streams.py", line 525, in readline                                                                                                                                                          │
full.log│    line = await self.readuntil(sep)                                                                                                                                                                                              │
full.log│  File "/usr/lib64/python3.10/asyncio/streams.py", line 617, in readuntil                                                                                                                                                         │
full.log│    await self._wait_for_data('readuntil')                                                                                                                                                                                        │
full.log│  File "/usr/lib64/python3.10/asyncio/streams.py", line 502, in _wait_for_data                                                                                                                                                    │
full.log│    await self._waiter                                                                                                                                                                                                            │
full.log│  File "/usr/lib64/python3.10/asyncio/selector_events.py", line 854, in _read_ready__data_received                                                                                                                                │
full.log│    data = self._sock.recv(self.max_size)                                                                                                                                                                                         │
full.log│ConnectionResetError: [Errno 104] Connection reset by peer

and worst yet to test tasks hanging indefinitely without the job
ever completing properly. This was mostly observed in cases of
LXC and remote spawner isolation where the isolated task process
completes but the task on the side of the task machine remains
unfinished.

Signed-off-by: Plamen Dimitrov <plamen.dimitrov@intra2net.com>
@pevogam
Copy link
Contributor Author

pevogam commented Mar 14, 2024

@richtja @clebergnu Some connection resets might also be caused by overloaded IO (e.g. internal on-host network and UNIX sockets cases of status server communication) and I have observed some tests spending 10-20 minutes logging a message which means that they are using regular blocking logging via Python's FileHandler. I was wondering if buffering could help here with something like

diff --git a/avocado/core/output.py b/avocado/core/output.py
index da61df1d8..8dda0c565 100644
--- a/avocado/core/output.py
+++ b/avocado/core/output.py
@@ -705,6 +705,12 @@ def add_log_handler(
     if handler_filter:
         handler.addFilter(handler_filter)
 
+    if klass == logging.FileHandler:
+        from logging import handlers
+        buffered_wrapper = handlers.MemoryHandler(capacity=100)
+        buffered_wrapper.setTarget(handler)
+        handler = buffered_wrapper
+
     logger.addHandler(handler)
     save_handler(logger.name, handler)
     return handler

but I am not sure if this change was effective at all in the current avocado logging logic since it seems to store handler configuration and other additional operations. Do you think that some buffering for the logging could be useful when implemented appropriately? Or do you have any other ideas what might cause such blocking in the tests since I have observed multuple VT tests waiting for extensive periods (and was correlated with the host system being under heavy load).

@pevogam
Copy link
Contributor Author

pevogam commented Mar 21, 2024

Some interesting further findings: it seems the above will need more handling on the client side since (right now just suspected but not fully confirmed) I get cases where the VT tests would complete and the parent task process will only wait for the child task process:

# strace -p 1709063
strace: Process 1709063 attached
wait4(1709064, ^Cstrace: Process 1709063 detached
 <detached ...>

with the test debug log indicating completion, while at the same time the child task process will not fully complete and hang at a step when writing to a pipe:

# strace -p 1709064
strace: Process 1709064 attached
write(5, "\0\0\1\30\200\4\225\r\1\0\0\0\0\0\0}\224(\214\10log_name\224\214\16a"..., 284^Cstrace: Process 1709064 detached
 <detached ...>

with the anonymous pipe used for logging:

# ls -l /proc/1709064/fd
total 0
lr-x------ 1 root root 64 Mar 21 13:14 0 -> /dev/null
lrwx------ 1 root root 64 Mar 21 13:14 1 -> /tmp/tmpx1w7evnv
...
l-wx------ 1 root root 64 Mar 21 13:14 5 -> 'pipe:[246227689]'

Worse yet, if the task hangs like this, in some cases another task will be started and deadlock on some shared resources like

# ls -l /proc/1805227/fd
total 0
lr-x------ 1 root root 64 Mar 21 12:57 0 -> /dev/null
lrwx------ 1 root root 64 Mar 21 12:57 1 -> /tmp/tmp317_53el
lrwx------ 1 root root 64 Mar 21 12:57 10 -> /tmp/avocado-vt-vm-create.lock

with

# lsof /tmp/avocado-vt-vm-create.lock
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
python3 1735751 root   91uW  REG  0,168        0   39 /tmp/avocado-vt-vm-create.lock
python3 1805227 root   10u   REG  0,168        0   39 /tmp/avocado-vt-vm-create.lock

so we really need to tell the client side to retry sending the log message or best yet understand what could be causes for such connection resets.

@clebergnu clebergnu self-assigned this Mar 21, 2024
@clebergnu
Copy link
Contributor

Hi @pevogam, thanks for sharing these findings. The first thing we need is some assertive reproducer for this issue. I know it can be tricky, but without one, we can't really be sure we're adding reliability into this.

Now, there have been talks and ideas of making most (or all) of the avocado-runner-* capable of using a more reliable way of caching and flushing the messages it needs to send to the task server. Something along the lines of the "journal" plugin, which keeps track of the messages it has sent to the ancient https://github.com/avocado-framework/avocado-server .

Is that something you think would help?

@clebergnu
Copy link
Contributor

Oh, and about the specific changes you're proposing here, they look to be a reliability improvement indeed.

I was verifying whether other kinds of exceptions should be treated the same away here, but asyncio.selector_events._SelectorSocketTransport._read_ready__data_received will capture other exceptions, silence them and simply return from the function, so I don't have (at this time) other suggestions.

@pevogam
Copy link
Contributor Author

pevogam commented Mar 22, 2024

Hi @pevogam, thanks for sharing these findings. The first thing we need is some assertive reproducer for this issue. I know it can be tricky, but without one, we can't really be sure we're adding reliability into this.

I guess the best we can do to reach this is to have lots of VT tests log in parallel and increase the logging throughput so much that some logging inevitably starts hanging as shown above. You are right that this would be pretty hard and require a lot of hardware (that we use as we run avocado VT tests in a cluster nowadays). Perhaps I could gather at least some hints on good directions to investigate first as I can reproduce this quite easily there and write back with outcomes here?

In particular, what are you thoughts on causes that could explain why a task would hang or sometimes spend 20+ minutes logging a message? I assume it has to be the blocking nature of logging in python, do you second my suggestion here? If so doesn't blocking logging in avocado defeat some of the idea of parallelism so far?

Now, there have been talks and ideas of making most (or all) of the avocado-runner-* capable of using a more reliable way of caching and flushing the messages it needs to send to the task server. Something along the lines of the "journal" plugin, which keeps track of the messages it has sent to the ancient https://github.com/avocado-framework/avocado-server .

Is that something you think would help?

That would definitely help. However, one problem from my attempt to add at least some quick and dirty buffering in #5881 (comment) was that it didn't really speed up the runs too much indicating that it might not be the bottleneck. Nevertheless, it might remedy the situation with the connection resets.

Oh, and about the specific changes you're proposing here, they look to be a reliability improvement indeed.

I was verifying whether other kinds of exceptions should be treated the same away here, but asyncio.selector_events._SelectorSocketTransport._read_ready__data_received will capture other exceptions, silence them and simply return from the function, so I don't have (at this time) other suggestions.

The problem from my last post above is that this only fixes half of the situation as it seems the tasks that log at that moment need some sort of reset or retry too. Otherwise I expect they might hang at the very end of the task as described in #5881 (comment).

pevogam added a commit to pevogam/avocado that referenced this pull request Apr 5, 2024
Details on hanging tasks deadlocking over the same vm lock resources
are described in:

avocado-framework#5881 (comment)

Signed-off-by: Plamen Dimitrov <plamen.dimitrov@intra2net.com>
@pevogam
Copy link
Contributor Author

pevogam commented Apr 5, 2024

Some interesting further findings: it seems the above will need more handling on the client side

The temporary solution I use right now for both LXC container and distributed runs is:

diff --git a/avocado/plugins/spawners/lxc.py b/avocado/plugins/spawners/lxc.py
index 13bc05b0e..ac9b6a8b5 100644
--- a/avocado/plugins/spawners/lxc.py
+++ b/avocado/plugins/spawners/lxc.py
@@ -253,6 +253,11 @@ class LXCSpawner(Spawner, SpawnerMixin):
         LOG.info(f"Container state: {container.state}")
         LOG.info(f"Container ID: {container_id} PID: {container.init_pid}")
 
+        exitcode, output, err = LXCSpawner.run_container_cmd(
+            container, ["pkill", "-f", "task-run"]
+        )
+        if exitcode == 0:
+            LOG.warning(output)
         exitcode, output, err = await LXCSpawner.run_container_cmd_async(
             container, entry_point_args
         )
diff --git a/optional_plugins/spawner_remote/avocado_spawner_remote/__init__.py b/optional_plugins/spawner_remote/avocado_spawner_remote/__init__.py
index 89a72d8b1..0cb60a419 100644
--- a/optional_plugins/spawner_remote/avocado_spawner_remote/__init__.py
+++ b/optional_plugins/spawner_remote/avocado_spawner_remote/__init__.py
@@ -170,6 +170,9 @@ class RemoteSpawner(Spawner, SpawnerMixin):
                 )
                 return False
 
+        status, output = session.cmd_status_output("pkill -f task-run")
+        if status == 0:
+            LOG.warning(output)
         cmd = shlex.join(entry_point_args) + " > /dev/null"
         timeout = self.config.get("spawner.remote.test_timeout")
         status, output = await RemoteSpawner.run_remote_cmd_async(session, cmd, timeout)

since the only cases of hanging tasks are right at the end when they are already complete and have even reported PASS status to the status server. Another small finding is that this could happen also in a less logging/IO intense test jobs but in rarer occasions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Review Requested
Development

Successfully merging this pull request may close these issues.

None yet

2 participants