Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't log from within tpool.execute #432

Open
smerritt opened this issue Aug 18, 2017 · 32 comments
Open

Can't log from within tpool.execute #432

smerritt opened this issue Aug 18, 2017 · 32 comments

Comments

@smerritt
Copy link
Contributor

If you try to log from within a function called by tpool.execute, there is a chance that the tpool thread never returns, and you see a stack trace like this one:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 458, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/semaphore.py", line 147, in _do_acquire
    waiter.switch()
error: cannot switch to a different thread

This is because many logging handlers have mutexes that are threading._RLock objects, and Eventlet's replacement for thread.allocate_lock returns an eventlet.semaphore.Semaphore object, which does not work across different hubs in different pthreads.

Here's a small script to reproduce the issue:

#!/usr/bin/env python
#
# Demonstrates the crash with logging across pthreads
import eventlet.patcher
import eventlet.tpool
import logging
import random
import sys
import time

eventlet.patcher.monkey_patch()

logger = logging.getLogger("logger-test")
# This handler's .lock is a threading._RLock
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.DEBUG)


def log_n_times(me, n):
    for x in range(n):
        logger.info("%s %d", me, x)
        time.sleep(random.random() * 0.01)


logger.info("starting")

greenthread = eventlet.spawn(log_n_times, 'greenthread', 50)
eventlet.tpool.execute(log_n_times, 'pthread', 50)
greenthread.wait()

logger.info("done")

The bug was originally found in Openstack Swift, and there's a better writeup at https://bugs.launchpad.net/swift/+bug/1710328 and a commit at openstack/swift@6d16079 that fixes the problem, but only for Swift, and not in a general way.

I'd like to figure out how to fix this in general, but I'm not sure how to proceed. Using a pipe-based mutex for all _RLock objects would work, but would be a very expensive fix. Perhaps just the locks in logging handlers, because those are global?

@temoto
Copy link
Member

temoto commented Aug 18, 2017

@smerritt thank you for this sad information.

Please try to replace your custom PipeMutex with eventlet.patcher.original('threading').Lock. If that works, I think we could include a custom green version of logging before actual problem is fixed.

Actual problem is, of course, green locks not working across OS threads.

@smerritt
Copy link
Contributor Author

@temoto That produces correct results, but the performance is not good. My usual process is a WSGI server, so a bunch of greenthreads on the main OS thread, and a few other OS threads hidden inside eventlet.tpool. When a tpool thread has the lock and a main-OS-thread greenthread tries to get it, the whole main OS thread is blocked. The pipe gives me a file descriptor to wait on, so in this case, other greenthreads in the main OS thread can keep working.

@temoto
Copy link
Member

temoto commented Aug 18, 2017

Sorry, I didn't think enough before saying. Of course it will block all other greenthreads.

You know the tpool itself was using pipe for some time and then switched to local socket connection. I think with a little ugly creativity you could leverage that synchronisation.

lock = eventlet.tpool.Proxy(original_threading.Lock())

It's still slow as pipe, but at least you don't have to add a crutch pipemutex to code base.

@smerritt
Copy link
Contributor Author

The proxied lock only works as long as you never run out of tpool threads.

Imagine you've got only two tpool threads, A and B, plus the main thread.

Thread A calls .acquire(), which goes into the tpool, and thread B picks it up, locks the lock, and goes back in the pool to do more work.

The next work item also needs the lock, so thread B calls .acquire(), the work item goes in the queue, and B blocks.

Thread A finishes with the lock, calls .release(), the work item goes in the queue, and A blocks.

@temoto
Copy link
Member

temoto commented Aug 22, 2017

Yes it's wasting tpool threads, but they're cheap and easy to increase.

@smerritt
Copy link
Contributor Author

True, but you can exhaust any finite tpool.

Thread A has the lock. B wants it, calls acquire. Now A has it, B waits on C, C waits on the lock. Then D wants it: A has it, B waits on C, C waits on the lock, D waits on E, E waits on the lock.

You can fill up the tpool with these pretty quickly. All it takes is one thread to hold the lock for a long time.

@temoto
Copy link
Member

temoto commented Aug 22, 2017

@smerritt dear Sam, I'm confused, I would imagine A eventually release the lock and whole thing continues. It seems only harmful as starvation against other usages of tpool. But considering that tpool proxied lock was a dirty workaround in the first place, this hardly deserves our time?

Real solution is to make eventlet work across OS threads. Today I was thinking how to implement it, and probably general greenlet multiplexing on OS threads is a bit too complex for now. But special treatment of synchronisation primitives seems doable.

@smerritt
Copy link
Contributor Author

If all the tpool threads are occupied, then the proxied call to release() will never happen, or at least that's what I thought. Perhaps if acquire() is proxied but release() is not, then it would all work.

You are, of course, correct that the answer is to make eventlet semaphores work across OS threads. I'm afraid I don't currently have any useful ideas to offer up in that domain.

openstack-gerrit pushed a commit to openstack/cinder that referenced this issue Jan 24, 2018
Since change I1f1d9c0d6e3f04f1ecd5ef7c5d813005ee116409 we are running
parts of the backups on native threads, which due to an eventlet bug [1]
have bad interactions with greenthreads, so we have to avoid any logging
when executing code in a native thread.

This patch removes the MD5 logging on the SwiftObjectWriter close
method and adds comments and docstring referring to this limitation.

[1] eventlet/eventlet#432

Closes-Bug: #1745168
Change-Id: I0857cecd7d8ab0ee7e3e9bd6e15f4987ede4d653
openstack-gerrit pushed a commit to openstack/cinder that referenced this issue Feb 3, 2018
Since change I1f1d9c0d6e3f04f1ecd5ef7c5d813005ee116409 we are running
parts of the backups on native threads, which due to an eventlet bug [1]
have bad interactions with greenthreads, so we have to avoid any logging
when executing code in a native thread.

This patch removes the MD5 logging on the SwiftObjectWriter close
method and adds comments and docstring referring to this limitation.

[1] eventlet/eventlet#432

Closes-Bug: #1745168
Change-Id: I0857cecd7d8ab0ee7e3e9bd6e15f4987ede4d653
(cherry picked from commit c6cb84b)
amito pushed a commit to Infinidat/cinder that referenced this issue Mar 8, 2018
Since change I1f1d9c0d6e3f04f1ecd5ef7c5d813005ee116409 we are running
parts of the backups on native threads, which due to an eventlet bug [1]
have bad interactions with greenthreads, so we have to avoid any logging
when executing code in a native thread.

This patch removes the MD5 logging on the SwiftObjectWriter close
method and adds comments and docstring referring to this limitation.

[1] eventlet/eventlet#432

Closes-Bug: #1745168
Change-Id: I0857cecd7d8ab0ee7e3e9bd6e15f4987ede4d653
gozer-gerrit pushed a commit to ArdanaCLM/cinder-ansible that referenced this issue May 25, 2018
This patch sets the log level for cinder backup process to
WARNING because of a bug in eventlet as described here:
eventlet/eventlet#432

Cinder volume doesn't have this problem, because it uses tooz locks
everywhere.

Change-Id: I96c1e61c442d9fd3ff2e016ede1b3b19ab4ba171
openstack-gerrit pushed a commit to openstack-archive/glare that referenced this issue Oct 24, 2018
As of now there no solution to the issue where thread is getting
stuck in eventlet.
Few other similar incidents and without proper solution:
https://bugs.launchpad.net/cinder/+bug/1694509
eventlet/eventlet#432
eventlet/eventlet#492
eventlet/eventlet#395

Change-Id: Ib278780ccb20b9cbef50f54ba1a1ad33761c8002
closes-bug: #1742729
openstack-gerrit pushed a commit to openstack-archive/glare that referenced this issue Mar 10, 2019
As of now there no solution to the issue where thread is getting
stuck in eventlet.
Few other similar incidents and without proper solution:
https://bugs.launchpad.net/cinder/+bug/1694509
eventlet/eventlet#432
eventlet/eventlet#492
eventlet/eventlet#395

Originally was taken from: https://review.openstack.org/#/c/613023/1


Change-Id: Ic924f0ef0cb632b2439dfb7d1092bebf54adb863
closes-bug: #1742729
@hemna
Copy link

hemna commented Mar 18, 2021

This is still an open issue?

@temoto
Copy link
Member

temoto commented Mar 18, 2021

Reproduction script still fails, yes.

openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Aug 15, 2022
* Update oslo.log from branch 'master'
  to 94b9dc32ec1f52a582adbd97fe2847f7c87d6c17
  - Fix logging in eventlet native threads
    
    There is a bug in eventlet where logging within a native thread can lead
    to a deadlock situation: eventlet/eventlet#432
    
    When encountered with this issue some projects in OpenStack using
    oslo.log, eg. Cinder, resolve them by removing any logging withing
    native threads.
    
    There is actually a better approach. The Swift team came up with a
    solution a long time ago [1], and in this patch that fix is included as
    part of the setup method, but will only be run if the eventlet library
    has already been loaded.
    
    This patch adds the eventlet library as a testing dependency for the
    PipeMutext unit tests.
    
    [1]: https://opendev.org/openstack/swift/commit/69c715c505cf9e5df29dc1dff2fa1a4847471cb6
    
    Closes-Bug: #1983863
    Change-Id: Iac1b0891ae584ce4b95964e6cdc0ff2483a4e57d
openstack-mirroring pushed a commit to openstack/oslo.log that referenced this issue Aug 15, 2022
There is a bug in eventlet where logging within a native thread can lead
to a deadlock situation: eventlet/eventlet#432

When encountered with this issue some projects in OpenStack using
oslo.log, eg. Cinder, resolve them by removing any logging withing
native threads.

There is actually a better approach. The Swift team came up with a
solution a long time ago [1], and in this patch that fix is included as
part of the setup method, but will only be run if the eventlet library
has already been loaded.

This patch adds the eventlet library as a testing dependency for the
PipeMutext unit tests.

[1]: https://opendev.org/openstack/swift/commit/69c715c505cf9e5df29dc1dff2fa1a4847471cb6

Closes-Bug: #1983863
Change-Id: Iac1b0891ae584ce4b95964e6cdc0ff2483a4e57d
Akrog added a commit to Akrog/cinder-operator that referenced this issue Sep 9, 2022
Cinder services as deployed by the operator just hang and will enter an
unending loop of kill and restart due to the Liveness probes.

What we see in the container logs differ from the cinder-api to the
other containers.

In the cinder-api we just see that it stops responding to requests, and
on the other services we see this exception:

  Traceback (most recent call last):
    File "/usr/lib/python3.9/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
      timer()
    File "/usr/lib/python3.9/site-packages/eventlet/hubs/timer.py", line 59, in __call__
      cb(*args, **kw)
    File "/usr/lib/python3.9/site-packages/eventlet/semaphore.py", line 152, in _do_acquire
      waiter.switch()
  greenlet.error: cannot switch to a different thread

In both cases the issue is the same, there is some logging happening on
a native thread and this is creating problems for eventlet, to the point
where it hangs.

This is a known bug in eventlet [1], one which I recently fixed in
Oslo-Log [2].

Since this is not fixed in all OpenStack releases, certainly not the one
this operator is currently using, we need to be careful with what we
actually enable for logging.

The logging we currently have enables debugging for EVERYTHING (rabbit,
sqlalchemy, oslo libraries, etc.), regardless of what we set in the
`debug` option and `default_log_levels` in `cinder.conf`.

This logging override is done via the `logging.conf` file and creates
the problem of the native thread logging.

Using the `logging.conf` file diverges from the approach we want for the
Cinder Operator, where we try to make the configuration of the Cinder
services with the operator be as close as possible to a manual Cinder
service configuration.

This patch removes the usage of the `logging.conf` file by the operator
and uses the `cinder.conf` template to set the right logging
configuration options.

We set `log_file = /dev/stdout` in `cinder.conf` instead of the usual
`log_file =` because then Cinder services would log to `stderr` and make
`httpd` on the cinder-api container treat all Cinder-API logs as errors,
prepending additional information to every single cinder log message,
like this:

  Thu Sep 08 08:21:36.404638 2022] [wsgi:error] [pid 15:tid 69] (sqlalchemy.orm.mapper.Mapper): 2022-09-08 08:21:36,404 INFO

[1]: eventlet/eventlet#432
[2]: https://review.opendev.org/c/openstack/oslo.log/+/852443
Akrog added a commit to Akrog/cinder-operator that referenced this issue Sep 9, 2022
Cinder services as deployed by the operator just hang and will enter an
unending loop of kill and restart due to the Liveness probes.

What we see in the container logs differ from the cinder-api to the
other containers.

In the cinder-api we just see that it stops responding to requests, and
on the other services we see this exception:

  Traceback (most recent call last):
    File "/usr/lib/python3.9/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
      timer()
    File "/usr/lib/python3.9/site-packages/eventlet/hubs/timer.py", line 59, in __call__
      cb(*args, **kw)
    File "/usr/lib/python3.9/site-packages/eventlet/semaphore.py", line 152, in _do_acquire
      waiter.switch()
  greenlet.error: cannot switch to a different thread

In both cases the issue is the same, there is some logging happening on
a native thread and this is creating problems for eventlet, to the point
where it hangs.

This is a known bug in eventlet [1], one which I recently fixed in
Oslo-Log [2].

Since this is not fixed in all OpenStack releases, certainly not the one
this operator is currently using, we need to be careful with what we
actually enable for logging.

The logging we currently have enables debugging for EVERYTHING (rabbit,
sqlalchemy, oslo libraries, etc.), regardless of what we set in the
`debug` option and `default_log_levels` in `cinder.conf`.

This logging override is done via the `logging.conf` file and creates
the problem of the native thread logging.

Using the `logging.conf` file diverges from the approach we want for the
Cinder Operator, where we try to make the configuration of the Cinder
services with the operator be as close as possible to a manual Cinder
service configuration.

This patch removes the usage of the `logging.conf` file by the operator
and uses the `cinder.conf` template to set the right logging
configuration options.

We set `log_file = /dev/stdout` in `cinder.conf` instead of the usual
`log_file =` because then Cinder services would log to `stderr` and make
`httpd` on the cinder-api container treat all Cinder-API logs as errors,
prepending additional information to every single cinder log message,
like this:

  Thu Sep 08 08:21:36.404638 2022] [wsgi:error] [pid 15:tid 69] (sqlalchemy.orm.mapper.Mapper): 2022-09-08 08:21:36,404 INFO

References to the `logging.conf` file have been removed from CRD
descriptions and other code locations.

[1]: eventlet/eventlet#432
[2]: https://review.opendev.org/c/openstack/oslo.log/+/852443
@ebolam
Copy link

ebolam commented Sep 21, 2022

Still an issue. Running across this issue with loguru when running flask in eventlet.

Carthaca added a commit to sapcc/cinder that referenced this issue Dec 14, 2022
fixes hanging thread due to
eventlet/eventlet#432
which may get fixed for oslo.log in 5.0.1
with openstack/oslo.log@94b9dc3
(at the time of writing master in antelope cycle is constraint to 5.0.0)
Carthaca added a commit to sapcc/cinder that referenced this issue Jan 2, 2023
fixes hanging thread due to
eventlet/eventlet#432
which may get fixed for oslo.log in 5.0.1
with openstack/oslo.log@94b9dc3
(at the time of writing master in antelope cycle is constraint to 5.0.0)
hemna pushed a commit to sapcc/cinder that referenced this issue Jan 23, 2023
fixes hanging thread due to
eventlet/eventlet#432
which may get fixed for oslo.log in 5.0.1
with openstack/oslo.log@94b9dc3
(at the time of writing master in antelope cycle is constraint to 5.0.0)
hemna pushed a commit to sapcc/cinder that referenced this issue Aug 29, 2023
fixes hanging thread due to
eventlet/eventlet#432
which may get fixed for oslo.log in 5.0.1
with openstack/oslo.log@94b9dc3
(at the time of writing master in antelope cycle is constraint to 5.0.0)
hemna pushed a commit to sapcc/cinder that referenced this issue Aug 30, 2023
fixes hanging thread due to
eventlet/eventlet#432
which may get fixed for oslo.log in 5.0.1
with openstack/oslo.log@94b9dc3
(at the time of writing master in antelope cycle is constraint to 5.0.0)
hemna pushed a commit to sapcc/cinder that referenced this issue Sep 13, 2023
fixes hanging thread due to
eventlet/eventlet#432
which may get fixed for oslo.log in 5.0.1
with openstack/oslo.log@94b9dc3
(at the time of writing master in antelope cycle is constraint to 5.0.0)
@frittentheke
Copy link

frittentheke commented Mar 7, 2024

This is still an open issue?

Reproduction script still fails, yes.

We are regularly running into this issue with different OpenStack components:

nova-compute[6888]: Traceback (most recent call last):
nova-compute[6888]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
nova-compute[6888]:     timer()
nova-compute[6888]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
nova-compute[6888]:     cb(*args, **kw)
nova-compute[6888]:   File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
nova-compute[6888]:     waiter.switch()
nova-compute[6888]: greenlet.error: cannot switch to a different thread

cinder-backup[3262663]: /usr/lib/python3/dist-packages/cinder/db/sqlalchemy/models.py:152: SAWarning: implicitly coercing SELECT object to scalar subquery; please use the .scalar_subquery() method to produce a scalar sub>
cinder-backup[3262663]:   last_heartbeat = column_property(
cinder-backup[3262663]: /usr/lib/python3/dist-packages/cinder/db/sqlalchemy/models.py:160: SAWarning: implicitly coercing SELECT object to scalar subquery; please use the .scalar_subquery() method to produce a scalar sub>
cinder-backup[3262663]:   num_hosts = column_property(
cinder-backup[3262663]: /usr/lib/python3/dist-packages/cinder/db/sqlalchemy/models.py:169: SAWarning: implicitly coercing SELECT object to scalar subquery; please use the .scalar_subquery() method to produce a scalar sub>
cinder-backup[3262663]:   num_down_hosts = column_property(
cinder-backup[3262663]: Traceback (most recent call last):
cinder-backup[3262663]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
cinder-backup[3262663]:     timer()
cinder-backup[3262663]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
cinder-backup[3262663]:     cb(*args, **kw)
cinder-backup[3262663]:   File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
cinder-backup[3262663]:     waiter.switch()
cinder-backup[3262663]: greenlet.error: cannot switch to a different thread

I understand that there was a fix done at least to the oslo.log https://review.opendev.org/c/openstack/oslo.log/+/852443 which is available with >5.3.x so the Bobcat release, which avoids running into this issue when logging.

Apart from fixing the root cause (if even possible or attempted) in eventlet or OpenStack migrating to asyncio (https://review.opendev.org/c/openstack/governance/+/902585) I was simply wondering if

a) Can the python process be made to crash / exit properly. Currently a process that runs into this issue becomes somewhat of a zombie. This makes recognizing this condition and triggering a restart (systemd, container runtime, ...) much more difficult.

b) If there is any more details / log that could be produced with this traceback to allow finding and fixing the calls that lead to the greenlet.error in the first place. I suppose there are more reasons this can happen than the one in oslo.log?

@4383
Copy link
Member

4383 commented Mar 7, 2024

Hello @frittentheke,

Concerning "b", in short term perspective, I think a way to retrieve this kind of details could be to use the debug module of eventlet (https://eventlet.readthedocs.io/en/latest/modules/debug.html) and even maybe, when my patch will be merged and released (#926), to start an eventlet interactive backdoor on the process in trouble, and see what happens in the hub.

As this problem feels like a race condition issue, another short term option would be to use the ebpf/bcc deadlock module to identify where the deadlock is located. https://github.com/iovisor/bcc/blob/master/tools/deadlock_example.txt

Concerning "a" for now I've no response. I'll back later if I've things to share with you concerning that point.

@frittentheke
Copy link

Concerning "b", in short term perspective, I think a way to retrieve this kind of details could be to use the debug module of eventlet (https://eventlet.readthedocs.io/en/latest/modules/debug.html) and even maybe, when my patch will be merged and released (#926), to start an eventlet interactive backdoor on the process in trouble, and see what happens in the hub.
As this problem feels like a race condition issue, another short term option would be to use the ebpf/bcc deadlock module to identify where the deadlock is located. https://github.com/iovisor/bcc/blob/master/tools/deadlock_example.txt

Thanks @4383 for those ideas. While I understand the goal is to replace eventlet, it's still somewhat important to have it produce more debug information out of the box when reading this "deadlock" state. How else would someone be able to fix certain usage pattern if there is no indication which code paths caused it.

Concerning "a" for now I've no response. I'll back later if I've things to share with you concerning that point.

That would be awesome. Having processes or whole components not fail cleanly is the worst in distributed systems :-)

@4383
Copy link
Member

4383 commented Mar 8, 2024

See if you can start an eventlet backdoor https://eventlet.readthedocs.io/en/latest/modules/backdoor.html

It would require a process restart, unfortunately I think you will loose the context of the bug, but you can wait to see if you reproduce it, and, then, jump into that backdoor for further investigations.

Our new maintenance policy is not against adding some debug capabilities. If you find useful info and new debug opportunities, then do not hesitate to propose a patch to share it with the community. We will be happy to review it and to propose it "out of the box".

On my side I'll try to find some spare time to play with the inital reproducer and see if it is possible to increase debug details to help developer catch up this kind of bug. But I should admit that's not my top priority for now.

openstack-mirroring pushed a commit to openstack/oslo.log that referenced this issue Mar 8, 2024
There is a bug in eventlet where logging within a native thread can lead
to a deadlock situation: eventlet/eventlet#432

When encountered with this issue some projects in OpenStack using
oslo.log, eg. Cinder, resolve them by removing any logging withing
native threads.

There is actually a better approach. The Swift team came up with a
solution a long time ago [1], and in this patch that fix is included as
part of the setup method, but will only be run if the eventlet library
has already been loaded.

This patch adds the eventlet library as a testing dependency for the
PipeMutext unit tests.

[1]: https://opendev.org/openstack/swift/commit/69c715c505cf9e5df29dc1dff2fa1a4847471cb6

Closes-Bug: #1983863
Change-Id: Iac1b0891ae584ce4b95964e6cdc0ff2483a4e57d
(cherry picked from commit 94b9dc3)
@frittentheke
Copy link

@4383 thanks again for your time and help!

Our new maintenance policy is not against adding some debug capabilities. If you find useful info and new debug opportunities, then do not hesitate to propose a patch to share it with the community. We will be happy to review it and to propose it "out of the box".

I would if I knew more about how eventlet works.

On my side I'll try to find some spare time to play with the inital reproducer and see if it is possible to increase debug details to help developer catch up this kind of bug. But I should admit that's not my top priority for now.

That would be awesome. If you kindly have a look at my comment https://bugs.launchpad.net/octavia/+bug/2039346/comments/14 about all the Openstack daemons we have throwing these greenlet.error: cannot switch to a different thread errors.

That launchpad bug is about an issue in oslo.log which apparently should not even exist in our Yoga release installation.
So we are again clueless what could be the cause.

@4383
Copy link
Member

4383 commented Mar 27, 2024

@frittentheke: oh ok. So I didn't make the link between this oslo.log problem and your gthread problem. I don't know why oslo.log is not fixed on zed.

Just to be sure, you observed this behavior on yoga, exact?

@4383
Copy link
Member

4383 commented Mar 27, 2024

Well, I think Openstack have many problems here:

  1. oslo.log lack of the original fix on zed, yoga https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17 Hence, leading you to observe the cannot switch non sense.
  2. if this fix is applied to these stable branches, then, it should be followed by an other backport of https://review.opendev.org/c/openstack/oslo.log/+/914190 on these stable branches too.

In other words, you actually suffers from incomplete backports.

@4383
Copy link
Member

4383 commented Mar 27, 2024

@frittentheke: I'd suggest to you to reach Daniel (damani), or Takashi (tkajinam) on the openstack oslo channel. I think they would be happy to help you to finalize these incomplete backports.

@frittentheke
Copy link

Just to be sure, you observed this behavior on yoga, exact?

Yes @4383, we run Yoga using Ubuntu Cloud Archive packages on 22.04 LTS.

But according to Takashi in https://bugs.launchpad.net/octavia/+bug/2039346/comments/10 the issue should not exist on Zed? Or is he mistaken and these fixed actually have be backported further?

@4383
Copy link
Member

4383 commented Mar 27, 2024

I think the problem is that zed and yoga do not contains (at least) https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17

The other patch is the fix to solve an other issue introduced by https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17 (the same patch). But in all case IMO I think we need this fix and its parent fixes (the childrens).

@SeanMooney
Copy link

SeanMooney commented Mar 27, 2024

while that will work on the epolls hub it wont work on the asyncio hub as
this is not supported https://review.opendev.org/c/openstack/oslo.log/+/852443/1/oslo_log/pipe_mutex.py#60

calling eventlet.debug.hub_prevent_multiple_readers(False) raises

RuntimeError("Multiple readers are not yet supported by asyncio hub")

so while a backport could be useful for older release like zed and yoga

the pipemutext impplenation in oslo log is going to need to be updated for the new asyncio hub.

you can see an example of the failure message you will get if you try to use both together

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_130/914108/5/check/tempest-full-py3/130576a/controller/logs/screen-n-api.txt

@4383
Copy link
Member

4383 commented Mar 28, 2024

From an eventlet perspective we won't support that "multiple readers" nonsense (in the Asyncio hub):

#874

Openstack deliverables may have to consider using dup() with file descriptor (https://www.man7.org/linux/man-pages/man2/dup.2.html).

@4383
Copy link
Member

4383 commented Mar 28, 2024

IMO this "multiple readers" feature comes from a bad design.
If Openstack deliverables are migrated to asyncio, the design will differ and I don't think we will have to rely on such functionality.

If oslo.log is migrated is async design would be refactored:

Surely would allowing us to bypass this "multiple readers" things.

@4383
Copy link
Member

4383 commented Mar 28, 2024

I'd rather suggest that we rely on socket.fromfd and os.dup to move away from the "multiple readers" on Openstack:

Or something like that.

@SeanMooney
Copy link

Just keep in mind that migration to asynscio will take 3-4 release and druing that time we will need to support running with either hub. There has not been a community agreement to move to explicit async yet either.

If using dup and socket.fromfd can be hidden within oslo.log that is fine but if that would require change to the projects that use oslo.log that's problematic

@frittentheke
Copy link

There has not been a community agreement to move to explicit async yet either.

Discussion about this is at https://review.opendev.org/c/openstack/governance/+/902585

@SeanMooney
Copy link

Yep but that has not been approved and it may be rejected. There is a lot of work that needs to be done to socialise that proposal and get buy in from all the project that currently use eventlets. It's unlikely that projects like nova will invest time in adopting explicit async in 2024.2 until we have time to consider the detailed implementation aspect for our project. I may do some pocs, but one of the theme for this cycles ptg is likely to be completing ongoing work form last cycle and focusing on maintenance and tech debt. Changing the threading model does not fit with that theme.
With that said I'm looking forward to discussing this at the ptg.

@4383
Copy link
Member

4383 commented Mar 28, 2024

Just keep in mind that migration to asynscio will take 3-4 release and druing that time we will need to support running with either hub. There has not been a community agreement to move to explicit async yet either.

If using dup and socket.fromfd can be hidden within oslo.log that is fine but if that would require change to the projects that use oslo.log that's problematic

I think we are all aware that the migration will take a couple of Openstack release, even possibly more than 4 releases...

If "multiple readers" hack come from libs, like oslo.log, then, I think it should be also possible to remove that hack at the lib level. It would be also possible to implement a kind of log feeder threads as Dan suggested on irc. Hence allowing top layers to enable the asyncio hub. This is the oslo.log use case.

Else, if the "multiple readers" hack is located at the service level, then this service could remains to use the epolls hub giving time to solve this problem at the service level. This is the swift use case.

@4383
Copy link
Member

4383 commented Mar 28, 2024

Concerning the PTG, I won't be around during this period, so if a discussion happen I won't join that discussion. Feel free to trigger one. I could follow them asynchronously...

Concerning myself, I'm not convinced that a face to face discussion would allow to have a better and efficient discussion than the one made through write up and proposal. Written exchange leave more rooms for better understanding and thinking. That's my point of view.

@4383
Copy link
Member

4383 commented Mar 28, 2024

The writings remain, the words fly away...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants