-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't log from within tpool.execute #432
Comments
@smerritt thank you for this sad information. Please try to replace your custom PipeMutex with Actual problem is, of course, green locks not working across OS threads. |
@temoto That produces correct results, but the performance is not good. My usual process is a WSGI server, so a bunch of greenthreads on the main OS thread, and a few other OS threads hidden inside |
Sorry, I didn't think enough before saying. Of course it will block all other greenthreads. You know the tpool itself was using pipe for some time and then switched to local socket connection. I think with a little ugly creativity you could leverage that synchronisation.
It's still slow as pipe, but at least you don't have to add a crutch pipemutex to code base. |
The proxied lock only works as long as you never run out of tpool threads. Imagine you've got only two tpool threads, A and B, plus the main thread. Thread A calls The next work item also needs the lock, so thread B calls Thread A finishes with the lock, calls |
Yes it's wasting tpool threads, but they're cheap and easy to increase. |
True, but you can exhaust any finite tpool. Thread A has the lock. B wants it, calls acquire. Now A has it, B waits on C, C waits on the lock. Then D wants it: A has it, B waits on C, C waits on the lock, D waits on E, E waits on the lock. You can fill up the tpool with these pretty quickly. All it takes is one thread to hold the lock for a long time. |
@smerritt dear Sam, I'm confused, I would imagine A eventually release the lock and whole thing continues. It seems only harmful as starvation against other usages of tpool. But considering that tpool proxied lock was a dirty workaround in the first place, this hardly deserves our time? Real solution is to make eventlet work across OS threads. Today I was thinking how to implement it, and probably general greenlet multiplexing on OS threads is a bit too complex for now. But special treatment of synchronisation primitives seems doable. |
If all the tpool threads are occupied, then the proxied call to release() will never happen, or at least that's what I thought. Perhaps if acquire() is proxied but release() is not, then it would all work. You are, of course, correct that the answer is to make eventlet semaphores work across OS threads. I'm afraid I don't currently have any useful ideas to offer up in that domain. |
Since change I1f1d9c0d6e3f04f1ecd5ef7c5d813005ee116409 we are running parts of the backups on native threads, which due to an eventlet bug [1] have bad interactions with greenthreads, so we have to avoid any logging when executing code in a native thread. This patch removes the MD5 logging on the SwiftObjectWriter close method and adds comments and docstring referring to this limitation. [1] eventlet/eventlet#432 Closes-Bug: #1745168 Change-Id: I0857cecd7d8ab0ee7e3e9bd6e15f4987ede4d653
Since change I1f1d9c0d6e3f04f1ecd5ef7c5d813005ee116409 we are running parts of the backups on native threads, which due to an eventlet bug [1] have bad interactions with greenthreads, so we have to avoid any logging when executing code in a native thread. This patch removes the MD5 logging on the SwiftObjectWriter close method and adds comments and docstring referring to this limitation. [1] eventlet/eventlet#432 Closes-Bug: #1745168 Change-Id: I0857cecd7d8ab0ee7e3e9bd6e15f4987ede4d653 (cherry picked from commit c6cb84b)
Since change I1f1d9c0d6e3f04f1ecd5ef7c5d813005ee116409 we are running parts of the backups on native threads, which due to an eventlet bug [1] have bad interactions with greenthreads, so we have to avoid any logging when executing code in a native thread. This patch removes the MD5 logging on the SwiftObjectWriter close method and adds comments and docstring referring to this limitation. [1] eventlet/eventlet#432 Closes-Bug: #1745168 Change-Id: I0857cecd7d8ab0ee7e3e9bd6e15f4987ede4d653
This patch sets the log level for cinder backup process to WARNING because of a bug in eventlet as described here: eventlet/eventlet#432 Cinder volume doesn't have this problem, because it uses tooz locks everywhere. Change-Id: I96c1e61c442d9fd3ff2e016ede1b3b19ab4ba171
As of now there no solution to the issue where thread is getting stuck in eventlet. Few other similar incidents and without proper solution: https://bugs.launchpad.net/cinder/+bug/1694509 eventlet/eventlet#432 eventlet/eventlet#492 eventlet/eventlet#395 Change-Id: Ib278780ccb20b9cbef50f54ba1a1ad33761c8002 closes-bug: #1742729
As of now there no solution to the issue where thread is getting stuck in eventlet. Few other similar incidents and without proper solution: https://bugs.launchpad.net/cinder/+bug/1694509 eventlet/eventlet#432 eventlet/eventlet#492 eventlet/eventlet#395 Originally was taken from: https://review.openstack.org/#/c/613023/1 Change-Id: Ic924f0ef0cb632b2439dfb7d1092bebf54adb863 closes-bug: #1742729
This is still an open issue? |
Reproduction script still fails, yes. |
* Update oslo.log from branch 'master' to 94b9dc32ec1f52a582adbd97fe2847f7c87d6c17 - Fix logging in eventlet native threads There is a bug in eventlet where logging within a native thread can lead to a deadlock situation: eventlet/eventlet#432 When encountered with this issue some projects in OpenStack using oslo.log, eg. Cinder, resolve them by removing any logging withing native threads. There is actually a better approach. The Swift team came up with a solution a long time ago [1], and in this patch that fix is included as part of the setup method, but will only be run if the eventlet library has already been loaded. This patch adds the eventlet library as a testing dependency for the PipeMutext unit tests. [1]: https://opendev.org/openstack/swift/commit/69c715c505cf9e5df29dc1dff2fa1a4847471cb6 Closes-Bug: #1983863 Change-Id: Iac1b0891ae584ce4b95964e6cdc0ff2483a4e57d
There is a bug in eventlet where logging within a native thread can lead to a deadlock situation: eventlet/eventlet#432 When encountered with this issue some projects in OpenStack using oslo.log, eg. Cinder, resolve them by removing any logging withing native threads. There is actually a better approach. The Swift team came up with a solution a long time ago [1], and in this patch that fix is included as part of the setup method, but will only be run if the eventlet library has already been loaded. This patch adds the eventlet library as a testing dependency for the PipeMutext unit tests. [1]: https://opendev.org/openstack/swift/commit/69c715c505cf9e5df29dc1dff2fa1a4847471cb6 Closes-Bug: #1983863 Change-Id: Iac1b0891ae584ce4b95964e6cdc0ff2483a4e57d
Cinder services as deployed by the operator just hang and will enter an unending loop of kill and restart due to the Liveness probes. What we see in the container logs differ from the cinder-api to the other containers. In the cinder-api we just see that it stops responding to requests, and on the other services we see this exception: Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers timer() File "/usr/lib/python3.9/site-packages/eventlet/hubs/timer.py", line 59, in __call__ cb(*args, **kw) File "/usr/lib/python3.9/site-packages/eventlet/semaphore.py", line 152, in _do_acquire waiter.switch() greenlet.error: cannot switch to a different thread In both cases the issue is the same, there is some logging happening on a native thread and this is creating problems for eventlet, to the point where it hangs. This is a known bug in eventlet [1], one which I recently fixed in Oslo-Log [2]. Since this is not fixed in all OpenStack releases, certainly not the one this operator is currently using, we need to be careful with what we actually enable for logging. The logging we currently have enables debugging for EVERYTHING (rabbit, sqlalchemy, oslo libraries, etc.), regardless of what we set in the `debug` option and `default_log_levels` in `cinder.conf`. This logging override is done via the `logging.conf` file and creates the problem of the native thread logging. Using the `logging.conf` file diverges from the approach we want for the Cinder Operator, where we try to make the configuration of the Cinder services with the operator be as close as possible to a manual Cinder service configuration. This patch removes the usage of the `logging.conf` file by the operator and uses the `cinder.conf` template to set the right logging configuration options. We set `log_file = /dev/stdout` in `cinder.conf` instead of the usual `log_file =` because then Cinder services would log to `stderr` and make `httpd` on the cinder-api container treat all Cinder-API logs as errors, prepending additional information to every single cinder log message, like this: Thu Sep 08 08:21:36.404638 2022] [wsgi:error] [pid 15:tid 69] (sqlalchemy.orm.mapper.Mapper): 2022-09-08 08:21:36,404 INFO [1]: eventlet/eventlet#432 [2]: https://review.opendev.org/c/openstack/oslo.log/+/852443
Cinder services as deployed by the operator just hang and will enter an unending loop of kill and restart due to the Liveness probes. What we see in the container logs differ from the cinder-api to the other containers. In the cinder-api we just see that it stops responding to requests, and on the other services we see this exception: Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers timer() File "/usr/lib/python3.9/site-packages/eventlet/hubs/timer.py", line 59, in __call__ cb(*args, **kw) File "/usr/lib/python3.9/site-packages/eventlet/semaphore.py", line 152, in _do_acquire waiter.switch() greenlet.error: cannot switch to a different thread In both cases the issue is the same, there is some logging happening on a native thread and this is creating problems for eventlet, to the point where it hangs. This is a known bug in eventlet [1], one which I recently fixed in Oslo-Log [2]. Since this is not fixed in all OpenStack releases, certainly not the one this operator is currently using, we need to be careful with what we actually enable for logging. The logging we currently have enables debugging for EVERYTHING (rabbit, sqlalchemy, oslo libraries, etc.), regardless of what we set in the `debug` option and `default_log_levels` in `cinder.conf`. This logging override is done via the `logging.conf` file and creates the problem of the native thread logging. Using the `logging.conf` file diverges from the approach we want for the Cinder Operator, where we try to make the configuration of the Cinder services with the operator be as close as possible to a manual Cinder service configuration. This patch removes the usage of the `logging.conf` file by the operator and uses the `cinder.conf` template to set the right logging configuration options. We set `log_file = /dev/stdout` in `cinder.conf` instead of the usual `log_file =` because then Cinder services would log to `stderr` and make `httpd` on the cinder-api container treat all Cinder-API logs as errors, prepending additional information to every single cinder log message, like this: Thu Sep 08 08:21:36.404638 2022] [wsgi:error] [pid 15:tid 69] (sqlalchemy.orm.mapper.Mapper): 2022-09-08 08:21:36,404 INFO References to the `logging.conf` file have been removed from CRD descriptions and other code locations. [1]: eventlet/eventlet#432 [2]: https://review.opendev.org/c/openstack/oslo.log/+/852443
Still an issue. Running across this issue with loguru when running flask in eventlet. |
fixes hanging thread due to eventlet/eventlet#432 which may get fixed for oslo.log in 5.0.1 with openstack/oslo.log@94b9dc3 (at the time of writing master in antelope cycle is constraint to 5.0.0)
fixes hanging thread due to eventlet/eventlet#432 which may get fixed for oslo.log in 5.0.1 with openstack/oslo.log@94b9dc3 (at the time of writing master in antelope cycle is constraint to 5.0.0)
fixes hanging thread due to eventlet/eventlet#432 which may get fixed for oslo.log in 5.0.1 with openstack/oslo.log@94b9dc3 (at the time of writing master in antelope cycle is constraint to 5.0.0)
fixes hanging thread due to eventlet/eventlet#432 which may get fixed for oslo.log in 5.0.1 with openstack/oslo.log@94b9dc3 (at the time of writing master in antelope cycle is constraint to 5.0.0)
fixes hanging thread due to eventlet/eventlet#432 which may get fixed for oslo.log in 5.0.1 with openstack/oslo.log@94b9dc3 (at the time of writing master in antelope cycle is constraint to 5.0.0)
fixes hanging thread due to eventlet/eventlet#432 which may get fixed for oslo.log in 5.0.1 with openstack/oslo.log@94b9dc3 (at the time of writing master in antelope cycle is constraint to 5.0.0)
We are regularly running into this issue with different OpenStack components:
I understand that there was a fix done at least to the Apart from fixing the root cause (if even possible or attempted) in eventlet or OpenStack migrating to asyncio (https://review.opendev.org/c/openstack/governance/+/902585) I was simply wondering if a) Can the python process be made to crash / exit properly. Currently a process that runs into this issue becomes somewhat of a zombie. This makes recognizing this condition and triggering a restart (systemd, container runtime, ...) much more difficult. b) If there is any more details / log that could be produced with this traceback to allow finding and fixing the calls that lead to the greenlet.error in the first place. I suppose there are more reasons this can happen than the one in oslo.log? |
Hello @frittentheke, Concerning "b", in short term perspective, I think a way to retrieve this kind of details could be to use the As this problem feels like a race condition issue, another short term option would be to use the ebpf/bcc Concerning "a" for now I've no response. I'll back later if I've things to share with you concerning that point. |
Thanks @4383 for those ideas. While I understand the goal is to replace eventlet, it's still somewhat important to have it produce more debug information out of the box when reading this "deadlock" state. How else would someone be able to fix certain usage pattern if there is no indication which code paths caused it.
That would be awesome. Having processes or whole components not fail cleanly is the worst in distributed systems :-) |
See if you can start an eventlet backdoor https://eventlet.readthedocs.io/en/latest/modules/backdoor.html It would require a process restart, unfortunately I think you will loose the context of the bug, but you can wait to see if you reproduce it, and, then, jump into that backdoor for further investigations. Our new maintenance policy is not against adding some debug capabilities. If you find useful info and new debug opportunities, then do not hesitate to propose a patch to share it with the community. We will be happy to review it and to propose it "out of the box". On my side I'll try to find some spare time to play with the inital reproducer and see if it is possible to increase debug details to help developer catch up this kind of bug. But I should admit that's not my top priority for now. |
There is a bug in eventlet where logging within a native thread can lead to a deadlock situation: eventlet/eventlet#432 When encountered with this issue some projects in OpenStack using oslo.log, eg. Cinder, resolve them by removing any logging withing native threads. There is actually a better approach. The Swift team came up with a solution a long time ago [1], and in this patch that fix is included as part of the setup method, but will only be run if the eventlet library has already been loaded. This patch adds the eventlet library as a testing dependency for the PipeMutext unit tests. [1]: https://opendev.org/openstack/swift/commit/69c715c505cf9e5df29dc1dff2fa1a4847471cb6 Closes-Bug: #1983863 Change-Id: Iac1b0891ae584ce4b95964e6cdc0ff2483a4e57d (cherry picked from commit 94b9dc3)
@4383 thanks again for your time and help!
I would if I knew more about how eventlet works.
That would be awesome. If you kindly have a look at my comment https://bugs.launchpad.net/octavia/+bug/2039346/comments/14 about all the Openstack daemons we have throwing these That launchpad bug is about an issue in oslo.log which apparently should not even exist in our Yoga release installation. |
@frittentheke: oh ok. So I didn't make the link between this oslo.log problem and your gthread problem. I don't know why oslo.log is not fixed on zed. Just to be sure, you observed this behavior on yoga, exact? |
Well, I think Openstack have many problems here:
In other words, you actually suffers from incomplete backports. |
@frittentheke: I'd suggest to you to reach Daniel (damani), or Takashi (tkajinam) on the openstack oslo channel. I think they would be happy to help you to finalize these incomplete backports. |
Yes @4383, we run Yoga using Ubuntu Cloud Archive packages on 22.04 LTS. But according to Takashi in https://bugs.launchpad.net/octavia/+bug/2039346/comments/10 the issue should not exist on Zed? Or is he mistaken and these fixed actually have be backported further? |
I think the problem is that zed and yoga do not contains (at least) https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17 The other patch is the fix to solve an other issue introduced by https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17 (the same patch). But in all case IMO I think we need this fix and its parent fixes (the childrens). |
So, in order, I think we these patches should be backported to zed and yoga: |
while that will work on the epolls hub it wont work on the asyncio hub as calling eventlet.debug.hub_prevent_multiple_readers(False) raises RuntimeError("Multiple readers are not yet supported by asyncio hub") so while a backport could be useful for older release like zed and yoga the pipemutext impplenation in oslo log is going to need to be updated for the new asyncio hub. you can see an example of the failure message you will get if you try to use both together |
From an eventlet perspective we won't support that "multiple readers" nonsense (in the Asyncio hub): Openstack deliverables may have to consider using |
IMO this "multiple readers" feature comes from a bad design. If oslo.log is migrated is async design would be refactored:
Surely would allowing us to bypass this "multiple readers" things. |
I'd rather suggest that we rely on
Or something like that. |
Just keep in mind that migration to asynscio will take 3-4 release and druing that time we will need to support running with either hub. There has not been a community agreement to move to explicit async yet either. If using dup and socket.fromfd can be hidden within oslo.log that is fine but if that would require change to the projects that use oslo.log that's problematic |
Discussion about this is at https://review.opendev.org/c/openstack/governance/+/902585 |
Yep but that has not been approved and it may be rejected. There is a lot of work that needs to be done to socialise that proposal and get buy in from all the project that currently use eventlets. It's unlikely that projects like nova will invest time in adopting explicit async in 2024.2 until we have time to consider the detailed implementation aspect for our project. I may do some pocs, but one of the theme for this cycles ptg is likely to be completing ongoing work form last cycle and focusing on maintenance and tech debt. Changing the threading model does not fit with that theme. |
I think we are all aware that the migration will take a couple of Openstack release, even possibly more than 4 releases... If "multiple readers" hack come from libs, like oslo.log, then, I think it should be also possible to remove that hack at the lib level. It would be also possible to implement a kind of log feeder threads as Dan suggested on irc. Hence allowing top layers to enable the asyncio hub. This is the oslo.log use case. Else, if the "multiple readers" hack is located at the service level, then this service could remains to use the epolls hub giving time to solve this problem at the service level. This is the swift use case. |
Concerning the PTG, I won't be around during this period, so if a discussion happen I won't join that discussion. Feel free to trigger one. I could follow them asynchronously... Concerning myself, I'm not convinced that a face to face discussion would allow to have a better and efficient discussion than the one made through write up and proposal. Written exchange leave more rooms for better understanding and thinking. That's my point of view. |
The writings remain, the words fly away... |
If you try to log from within a function called by
tpool.execute
, there is a chance that the tpool thread never returns, and you see a stack trace like this one:This is because many logging handlers have mutexes that are
threading._RLock
objects, and Eventlet's replacement forthread.allocate_lock
returns aneventlet.semaphore.Semaphore
object, which does not work across different hubs in different pthreads.Here's a small script to reproduce the issue:
The bug was originally found in Openstack Swift, and there's a better writeup at https://bugs.launchpad.net/swift/+bug/1710328 and a commit at openstack/swift@6d16079 that fixes the problem, but only for Swift, and not in a general way.
I'd like to figure out how to fix this in general, but I'm not sure how to proceed. Using a pipe-based mutex for all
_RLock
objects would work, but would be a very expensive fix. Perhaps just the locks in logging handlers, because those are global?The text was updated successfully, but these errors were encountered: