Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gevent.hub.LoopExit: This operation would block forever #2010

Open
F3ngdw opened this issue Jan 8, 2024 · 3 comments
Open

gevent.hub.LoopExit: This operation would block forever #2010

F3ngdw opened this issue Jan 8, 2024 · 3 comments
Labels
Type: Question User support and/or waiting for responses

Comments

@F3ngdw
Copy link

F3ngdw commented Jan 8, 2024

  • gevent version: 1.0.2
  • Python version: 2.7.5
  • Operating System: centos7

Description:

Hi, @jamadden
my project is based on https://github.com/ceph/calamari. Recently, there have been errors in the cthulhu logs in our production environment

Traceback (most recent call last):
  File "/opt/calamari/venv/bin/cthulhu-manager", line 9, in <module>
    load_entry_point('calamari-cthulhu==0.1', 'console_scripts', 'cthulhu-manager')()
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/manager.py", line 1156, in main
    complete.wait(timeout=1)
  File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/event.py", line 77, in wait
    result = self.hub.switch()
  File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/hub.py", line 338, in switch
    return greenlet.switch(self)
gevent.hub.LoopExit: This operation would block forever

it comes from https://github.com/ceph/calamari/blob/master/cthulhu/cthulhu/manager/manager.py, which is program main Entry main() function.
All logs have no abnormal information except for the error mentioned above. I have been researching for a long time, and I have searched through all the issues but have not found the answer.

GDB debug info:
Abnormal GDB information in production environment

(gdb) info threads
  Id   Target Id         Frame 
  19   Thread 0x7f3b5ec01740 (LWP 25334) 0x00007f3b5e417afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
  18   Thread 0x7f3b434c1700 (LWP 17974) 0x00007f3b5e417afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
  17   Thread 0x7f3b03fff700 (LWP 28483) 0x00007f3b5e417afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
  16   Thread 0x7f3b01ffb700 (LWP 27307) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  15   Thread 0x7f3b027fc700 (LWP 27306) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  14   Thread 0x7f3b02ffd700 (LWP 27305) 0x00007f3b5da31e83 in epoll_wait () from /lib64/libc.so.6
  13   Thread 0x7f3b20ff9700 (LWP 27296) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  12   Thread 0x7f3b21ffb700 (LWP 27225) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  11   Thread 0x7f3b227fc700 (LWP 27224) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  10   Thread 0x7f3b22ffd700 (LWP 27222) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  9    Thread 0x7f3b237fe700 (LWP 27220) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  8    Thread 0x7f3b23fff700 (LWP 27219) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  7    Thread 0x7f3b40ffd700 (LWP 27218) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  6    Thread 0x7f3b417fe700 (LWP 27217) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  5    Thread 0x7f3b41fff700 (LWP 27215) 0x00007f3b5da26c0d in poll () from /lib64/libc.so.6
  4    Thread 0x7f3b42800700 (LWP 27214) 0x00007f3b5da26c0d in poll () from /lib64/libc.so.6
  3    Thread 0x7f3b43e42700 (LWP 27213) 0x00007f3b5da28973 in select () from /lib64/libc.so.6
  2    Thread 0x7f3b44703700 (LWP 27041) 0x00007f3b5da31e83 in epoll_wait () from /lib64/libc.so.6
* 1    Thread 0x7f3b44f04700 (LWP 27040) 0x00007f3b5da31e83 in epoll_wait () from /lib64/libc.so.6

Normal gdb information for testing environment

(gdb) info threads
  Id   Target Id         Frame 
* 19   Thread 0x7f337dbca700 (LWP 13619) "cthulhu-manager" 0x00007f33977bce83 in epoll_wait () from /lib64/libc.so.6
  18   Thread 0x7f337d3c9700 (LWP 13620) "cthulhu-manager" 0x00007f33977bce83 in epoll_wait () from /lib64/libc.so.6
  17   Thread 0x7f337cb08700 (LWP 15077) "cthulhu-manager" 0x00007f33977b1c0d in poll () from /lib64/libc.so.6
  16   Thread 0x7f3377fff700 (LWP 15079) "cthulhu-manager" 0x00007f33977b1c0d in poll () from /lib64/libc.so.6
  15   Thread 0x7f33777fe700 (LWP 15211) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  14   Thread 0x7f3376ffd700 (LWP 15212) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  13   Thread 0x7f33767fc700 (LWP 15213) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  12   Thread 0x7f3375ffb700 (LWP 15214) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  11   Thread 0x7f33757fa700 (LWP 15218) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  10   Thread 0x7f3374ff9700 (LWP 15219) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  9    Thread 0x7f3357fff700 (LWP 15220) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  8    Thread 0x7f33577fe700 (LWP 15306) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  7    Thread 0x7f3356ffd700 (LWP 15337) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  6    Thread 0x7f33557fa700 (LWP 15582) "cthulhu-manager" 0x00007f33977bce83 in epoll_wait () from /lib64/libc.so.6
  5    Thread 0x7f3354ff9700 (LWP 15584) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  4    Thread 0x7f3333fff700 (LWP 15587) "cthulhu-manager" 0x00007f33977b3973 in select () from /lib64/libc.so.6
  3    Thread 0x7f33567fc700 (LWP 21437) "cthulhu-manager" 0x00007f33981a2afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
  2    Thread 0x7f3355ffb700 (LWP 9427) "cthulhu-manager" 0x00007f33981a2afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
  1    Thread 0x7f339898c740 (LWP 3741) "cthulhu-manager" 0x00007f33977bce83 in epoll_wait () from /lib64/libc.so.6

By comparing their stack information, I found that thread 19 in the production environment should correspond to thread 1 in the testing environment.
thread 19 backtrace (production environment)

(gdb) thread 19
[Switching to thread 19 (Thread 0x7f3b5ec01740 (LWP 25334))]
#0  0x00007f3b5e417afb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
(gdb) py-list
 334            waiter.acquire()
 335            self.__waiters.append(waiter)
 336            saved_state = self._release_save()
 337            try:    # restore state no matter what (e.g., KeyboardInterrupt)
 338                if timeout is None:
>339                    waiter.acquire()
 340                    if __debug__:
 341                        self._note("%s.wait(): got it", self)
 342                else:
 343                    # Balancing act:  We can't afford a pure busy loop, so we
 344                    # have to sleep; but if we sleep the whole timeout time,
(gdb) py-bt
#3 Waiting for a lock (e.g. GIL)
#4 Waiting for a lock (e.g. GIL)
#6 Frame 0x348e730, for file /usr/lib64/python2.7/threading.py, line 339, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f3b46bc0d70>, acquire=<built-in method acquire of thread.lock object at remote 0x7f3b46bc0d70>, _Condition__waiters=[<thread.lock at remote 0x7f3b429e35b0>], release=<built-in method release of thread.lock object at remote 0x7f3b46bc0d70>) at remote 0x7f3b435bbd90>, timeout=None, balancing=True, waiter=<thread.lock at remote 0x7f3b429e35b0>, saved_state=None)
    waiter.acquire()
#10 Frame 0x7f3b242bd840, for file /usr/lib64/python2.7/threading.py, line 951, in join (self=<Ticker(_run_on_start=True, _Thread__ident=139891958597376, _callback=<instancemethod at remote 0x7f3b435b6230>, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f3b46bc0d70>, acquire=<built-in method acquire of thread.lock object at remote 0x7f3b46bc0d70>, _Condition__waiters=[<thread.lock at remote 0x7f3b429e35b0>], release=<built-in method release of thread.lock object at remote 0x7f3b46bc0d70>) at remote 0x7f3b435bbd90>, _Thread__name='Thread-10', _Thread__daemonic=False, _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f3b46bc0af0>, acquire=<built-in method acquire of thread.lock object at remote 0x7f3b46bc0af0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f3b46bc0af0>) at remote 0x7f3b435bbe50>) at remote 0x7f3b...(truncated)
    self.__block.wait()
#14 Frame 0x7f3af8029a80, for file /usr/lib64/python2.7/threading.py, line 1109, in _exitfunc (self=<_MainThread(_Thread__ident=139892969445184, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f3b5eb94b70>, acquire=<built-in method acquire of thread.lock object at remote 0x7f3b5eb94b70>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f3b5eb94b70>) at remote 0x7f3b525169d0>, _Thread__name='MainThread', _Thread__daemonic=False, _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f3b5eb94c30>, acquire=<built-in method acquire of thread.lock object at remote 0x7f3b5eb94c30>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f3b5eb94c30>) at remote 0x7f3b52516910>) at remote 0x7f3b52516850>, _Thread__stderr=<file at remote 0x7f3b5ebe31e0>, _Thread__target=None, _Thread__kwargs={...(truncated)
    t.join()

thread 1 backtrace (test environment)

(gdb) py-list
 366            assert self is getcurrent(), 'Do not call Hub.run() directly'
 367            while True:
 368                loop = self.loop
 369                loop.error_handler = self
 370                try:
>371                    loop.run()
 372                finally:
 373                    loop.error_handler = None  # break the refcount cycle
 374                self.parent.throw(LoopExit('This operation would block forever'))
 375            # this function must never return, as it will cause switch() in the parent greenlet
 376            # to return an unexpected value
(gdb) py-bt
#6 Frame 0x7f337dbe2b60, for file /opt/calamari/venv/lib/python2.7/site-packages/gevent/hub.py, line 371, in run (self=<Hub(_resolver=None, format_context=<function at remote 0x7f33830baaa0>, _threadpool=None, loop=<gevent.core.loop at remote 0x7f337fd46ef0>) at remote 0x7f337fd47f50>, loop=<gevent.core.loop at remote 0x7f337fd46ef0>)
    loop.run()

As can be seen, thread 1 of the testing environment ran gevent's loop normally, but thread 19 of the generating environment exited abnormally.
I have read many issues and mentioned monkey patches, and I have checked calamari's code and it is true that there is no monkey patched, So I suspect that this is the reason for the error. But because I cannot reproduce this error in my testing environment, So I would like to ask if you can give me some ideas on this issue. Thank you!

@F3ngdw
Copy link
Author

F3ngdw commented Jan 8, 2024

Just to add:
A Ticker class can be seen in the py-bt of thread 19, actually Ticker is a tool type thread used to make timers,

class Ticker(threading.Thread):
    def __init__(self, period, callback, *args, **kwargs):
        self._run_on_start = kwargs.pop("run_on_start", True)
        super(Ticker, self).__init__(*args, **kwargs)
        self._period = period
        self._callback = callback
        self._hand = True

    def stop(self):
        self._hand = False

    def run(self):
        while self._hand:
            if not self._run_on_start:
                time.sleep(self._period)

            try:
                self._callback()
            except BaseException:
                log.error(traceback.format_exc())

            if self._run_on_start:
                time.sleep(self._period)

Is it because a certain Ticker thread did not apply a monkey patch?

@jamadden jamadden added the Type: Question User support and/or waiting for responses label Jan 8, 2024
@jamadden
Copy link
Member

jamadden commented Jan 8, 2024

Yes, a LoopExit can mean there is no other greenlet to switch to, and that in turn can be caused by only partly monkey-patching (e.g., not patching thread, but patching socket) or by using gevent objects directly but neglecting to use greenlets.

gevent version: 1.0.2
Python version: 2.7.5

That is a truly ancient version of gevent, almost 9 years old. I don't even remember enough about how it worked to hazard a guess beyond what is already documented (i.e., you have no other greenlet to switch to). I strongly encourage upgrading.

Your Python version is also not supported by recent gevent releases. The last version that supported Python 2 was from 2022, but it needs at least Python 2.7.9.

@F3ngdw
Copy link
Author

F3ngdw commented Jan 9, 2024

@jamadden Thank you for your reply!

and that in turn can be caused by only partly monkey-patching (e.g., not patching thread, but patching socket)

no, my project didn't patch anything

by using gevent objects directly but neglecting to use greenlets.

sorry, I don't understand what this means. In my project, the coroutine is implemented by inheriting gevent.greenlet.Greenlet
e.g.

class Persister(gevent.greenlet.Greenlet):
    def __init__(self):
        super(Persister, self).__init__()

        self._queue = gevent.queue.Queue()
        self._complete = gevent.event.Event()

        self._session = Session()

    def __getattribute__(self, item):
        """
        Wrap functions with logging
        """
        if item.startswith('_'):
            return object.__getattribute__(self, item)
        else:
            try:
                return object.__getattribute__(self, item)
            except AttributeError:
                try:
                    attr = object.__getattribute__(self, "_%s" % item)
                    if callable(attr):
                        def defer(*args, **kwargs):
                            dc = DeferredCall(attr, args, kwargs)
                            self._queue.put(dc)

                        return defer
                    else:
                        return object.__getattribute__(self, item)
                except AttributeError:
                    return object.__getattribute__(self, item)
  
    def _run(self):
        log.info("Persister listening")

        while not self._complete.is_set():
            try:
                data = self._queue.get(block=True, timeout=1)
            except gevent.queue.Empty:
                continue
            else:
                try:
                    data.fn(*data.args, **data.kwargs)
                    self._session.commit()
                except Exception:
                    # Catch-all because all kinds of things can go wrong and
                    # our behaviour is the same: log the exception, the data
                    # that caused it, then try to go back to functioning.
                    log.exception(
                        "Persister exception persisting data: %s" %
                        (data.fn,))

                    self._session.rollback()

    def stop(self):
        self._complete.set()

And I have another question, Is it necessary to apply a monkey patch when gevent and threads coexist?
Thank you again and I will be looking forward to your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Question User support and/or waiting for responses
Projects
None yet
Development

No branches or pull requests

2 participants