Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition leading to the main loop dying? #374

Closed
oakkitten opened this issue Apr 11, 2022 · 17 comments · Fixed by #377
Closed

Possible race condition leading to the main loop dying? #374

oakkitten opened this issue Apr 11, 2022 · 17 comments · Fixed by #377
Labels

Comments

@oakkitten
Copy link

I just might be wrong because if this indeed a race condition it should be breaking more things. Anyway, I got this exception (line numbers might be wrong due to debug statements):

Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/.../lib/python3.8/site-packages/aqt/mediasrv.py", line 89, in run
    self.server.run()
  File "/.../lib/python3.8/site-packages/waitress/server.py", line 323, in run
    self.asyncore.loop(
  File "/.../lib/python3.8/site-packages/waitress/wasyncore.py", line 285, in loop
    poll_fun(timeout, map)
  File "/.../lib/python3.8/site-packages/waitress/wasyncore.py", line 211, in poll
    r, w, e = select.select(r, w, e, timeout)
OSError: [Errno 9] Bad file descriptor

This error was extremely rare but since I was getting it while running tests I could just run a lot of them until one failed, which I did, and I think the problem is a follows.

  1. First, thread Thread-1 that the app I'm testing is launching, one that runs waitress server, assembles the descriptor lists for select:

    r = []
    w = []
    e = []
    for fd, obj in list(map.items()): # list() call FBO py3
    is_r = obj.readable()
    is_w = obj.writable()
    if is_r:
    r.append(fd)
    # accepting sockets should not be writable
    if is_w and not obj.accepting:
    w.append(fd)
    if is_r or is_w:
    e.append(fd)

  2. Then, thread waitress-0 deletes one of the channels, in my case it was <waitress.channel.HTTPChannel 127.0.0.1:54044 at 0x7f10ec052400>, and immediately closes the socket:

    def close(self):
    self.connected = False
    self.accepting = False
    self.connecting = False
    self.del_channel()
    if self.socket is not None:
    try:
    self.socket.close()
    except OSError as why:
    if why.args[0] not in (ENOTCONN, EBADF):
    raise

    Stack of waitiress-0 at the moment:

      File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
        self._bootstrap_inner()
      File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
      File "/usr/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
      File "/.../lib/python3.8/site-packages/waitress/task.py", line 84, in handler_thread
        task.service()
      File "/.../lib/python3.8/site-packages/waitress/channel.py", line 426, in service
        task.service()
      File "/.../lib/python3.8/site-packages/waitress/task.py", line 168, in service
        self.execute()
      File "/.../lib/python3.8/site-packages/waitress/task.py", line 451, in execute
        self.channel.write_soon(app_iter)
      File "/.../lib/python3.8/site-packages/waitress/channel.py", line 377, in write_soon
        (flushed, exception) = self._flush_exception(self._flush_some)
      File "/.../lib/python3.8/site-packages/waitress/channel.py", line 132, in _flush_exception
        return (flush(), False)
      File "/.../lib/python3.8/site-packages/waitress/channel.py", line 270, in _flush_some
        num_sent = self.send(chunk)
      File "/.../lib/python3.8/site-packages/waitress/wasyncore.py", line 479, in send
        self.handle_close()
      File "/.../lib/python3.8/site-packages/waitress/channel.py", line 317, in handle_close
        wasyncore.dispatcher.close(self)
      File "/.../lib/python3.8/site-packages/waitress/wasyncore.py", line 519, in close
        traceback.print_stack()
  3. Then, thread Thread-1 is trying to see if the file descriptor of the socked closed above is writable, which leads to the the exception above:

    try:
    r, w, e = select.select(r, w, e, timeout)
    except OSError as err:
    if err.args[0] != EINTR:
    raise
    else:
    return

Python 3.8.10, waitress 2.1.1, Ubuntu 20.04 LTS focal @ WSL2

@digitalresistor
Copy link
Member

@oakkitten could you try this patch: #377 and see if it stops the main thread from dying? I have been unable to come up with a good way to test/validate this fixes the issue.

It's not the best solution, but short of rewriting asyncore it hopefully provides relief as we just try again.

@oakkitten
Copy link
Author

I'm not familiar with the codebase, but from what I've seen while trying to catch the issue—

Perhaps, instead of calling del_channel, which mutates map, and then closing the socket in a worker thread, it would be more simple to just flag the channel for closing? Like, you set a flag on a channel, such as must_be_closed_in_the_main_loop, or perhaps have a separate list of objects that must be closed (lists are said to be thread-safe), and then the main loop checks if anything should be closed before doing select(). This way the code would be a bit easier to reason about I think.

Also, you are doing

if fds != map.keys():
    fds = map.keys()

dict.keys() returns not a copy of keys, but a special dict_keys object that is a view of dictionary keys that changes along with the dictionary. So if read this right, fds == map.keys() will always be true.

Also, there's the issue with the modification of map itself. While it never blew up in my face during tests, it probably can, since when you modify a dictionary while iterating you get this:

>>> d = {x: x for x in range(5)}
>>> for x in d.keys():
...     print(x)
...     if x == 3:
...         del d[x]
...
0
1
2
3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration

I suppose even doing list(map.keys()) is dangerous.

@digitalresistor
Copy link
Member

Good point on the dict.keys() I forgot about that change in Py3.

We never iterate the map and delete from it though in the code change I made.

I am trying to avoid changing too much of asyncore as that was wholesale lifted from Py 3.5.

If you were to use poll() instead of select() I would imagine the problem would be masked because poll() will simply return that there is an error on the socket, whereas select() fails and raises OSError. So I was attempting to make select() act more like poll() in this situation.

This only really happens if the remote closes the connection before having read the full response, which is likely why it is not often seen in the wild, as most reverse proxies will have already read the full response.

The problem with attempting to add yet another list, is that there is no single object, the map is passed down from the serve function, and is passed around, but to add another list that would mean adding another variable that is passed around to the various functions as the loop() function does not currently have that.

Trying to set a flag inside the thread and trying to make sure that can be read throughout the asyncore functions so that it doesn't close the sockets was something I looked at, but it is difficult as there are various places where asyncore expects to be able to close the socket.

It's going to require more intrusive changes, which makes sense as asyncore was never really meant to be thread safe. This is a consequence of trying to send data faster without waking the main thread up.

@digitalresistor
Copy link
Member

Using list(map.keys()) should be fine. Since the only reason select() can fail is if it has a file descriptor that is no longer valid (and thus no longer in the map). Comparing the original map we used for select() against the new map.keys() should show that there is a difference, at that point we just start the select() call again with the new list of file descriptors.

Eventually a call to select() will succeed.

I am going to see if I can write a test that causes the race.

You mentioned you are seeing this in tests, are you running waitress in a thread or a separate process?

@oakkitten
Copy link
Author

You don't iterate the map and delete from it in the code change you made, but I think you might be doing that in the already existing code?

So the question is, is list() doing or calling anything that touches GIL in some kind of a guaranteed way. Well, I don't know, really. I asked on #python if it will call some magic thread-safe method for converting keys to a list and no-one was aware of such a method. And I don't see the documentation mention anything like that. Even if this is safe and indeed not an implementation detail, I know I wouldn't want to rely on this behavior, anyway. This is too complicated!
image

I am testing an addon for a Qt app that uses waitress internally. It launches it in a thread. This is a patch I made to fix the issue in a dumb way, which should be ok for tests. If you are willing, you can try grabbing the commit before that and running the tests (see tox.ini for instructions; the tests are slow and require a lot of dependencies so it won't be fun). The problem is rare, but you can force it by inserting some kind of a delay before the select call, then it manifests right away.

@digitalresistor
Copy link
Member

I updated #377 (comment) with the new changes. Give this a shot please.

@digitalresistor
Copy link
Member

It removes any races, only the main thread can close the socket...

@oakkitten
Copy link
Author

I pulled 4f6789b and 4800 tests and 108 minutes later there were no crash. Thanks! 🎉

wesleybl added a commit to collective/collective.cover that referenced this issue Jun 2, 2022
waitress = 2.1.2 fix the error:

   OSError: [Errno 9] Bad file descriptor

See: Pylons/waitress#374

This error causes the server to "freeze" on robot tests and the tests
never finish.
@lamby
Copy link

lamby commented Jun 21, 2022

I spot that the description for CVE-2022-31015 mentions that this affects "versions 2.1.0 and 2.1.1". However, a quick glance at the code suggests that this might be because it affects the wasyncore module which was vendored in version 1.2. Does this issue, then, affect these older versions as well? Thanks. :)

@mmerickel
Copy link
Member

No. It’s not a bug in wasyncore but rather waitress began trying to invoke wasyncore methods like close() from other threads that caused the issue.

@lamby
Copy link

lamby commented Jun 21, 2022

Getcha. However, the fix essentially requires the vendored version, no? Otherwise the do_close thing can't be passed around? :)

@mmerickel
Copy link
Member

It enables a performance optimization where waitress can write to the socket safely from a thread.

@mmerickel mmerickel reopened this Jun 21, 2022
@digitalresistor
Copy link
Member

@lamby the vendored version is always used, even on Python versions that have asyncore support built-in. Waitress does not attempt to use the non-vendored version and then fall back to the vendored version.

@lamby
Copy link

lamby commented Jun 22, 2022

That makes sense. However, someone using a very very old version of waitress (prior to the module being vendored in, that is...) would be vulnerable to this issue?

@mmerickel
Copy link
Member

Again no because the bug was only due to a change in how we USED asyncore in the specified versions of waitress. Waitress used it differently and safely before then. And it does again after the cve fix. We documented the affected versions correctly in the cve.

@lamby
Copy link

lamby commented Jun 22, 2022

Thanks, really appreciate it. :)

@lettow-humain
Copy link

I am facing this as well:

import waitress
from flask import Flask

app = Flask(__name__)

wsgi = waitress.create_server(app, host='0.0.0.0', port='1234')
wsgi.run()

While the main loop is running, I am calling wsgi.close() from elsewhere, which yields:

Traceback (most recent call last):
  File "/myPath/./main.py", line 1368, in main
    self.wsgi.run()
  File "/myPath/lib/python3.11/site-packages/waitress/server.py", line 322, in run
    self.asyncore.loop(
  File "/myPath/lib/python3.11/site-packages/waitress/wasyncore.py", line 245, in loop
    poll_fun(timeout, map)
  File "/myPath/lib/python3.11/site-packages/waitress/wasyncore.py", line 172, in poll
    r, w, e = select.select(r, w, e, timeout)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 9] Bad file descriptor

Is there a recommended way of preventing this or handling it in a more clean way? Thanks!
I'm using

python3.11
waitress2.1.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants