Make sure handler.flush() doesn't deadlock. #1112

gukoff · 2022-03-15T21:26:53Z

Currently flush() deadlocks during process termination if there's any unsent messages in the queue.

This is because atexit first calls handler.close() and then logging.shutdown(), that in turn calls handler.flush() without arguments. I.e. handler.close() kills the worker, and then handler.flush() forever waits for the dead worker to send the messages from the queue.

Stacktrace dump by py-spy of the application in a deadlock:

Thread 8900 (idle): "MainThread"
    wait (threading.py:296)
    wait (threading.py:552)
    wait (opencensus\common\schedule\__init__.py:75)
    flush (opencensus\common\schedule\__init__.py:127)
    flush (opencensus\ext\azure\log_exporter\__init__.py:109)
    shutdown (logging\__init__.py:2036)

After this change, the deadlock is still possible if another thread concurrently closes the handler during the flush. However, this scenario is much less likely.

contrib/opencensus-ext-azure/opencensus/ext/azure/log_exporter/__init__.py

Currently it deadlocks during process termination, when atexit first calls handler.close() and then logging.shutdown(), that in turn calls handler.flush() without arguments. handler.close() kills the worker, and then handler.flush() forever waits for the dead worker to send the messages from the queue. After this change, the deadlock is still possible if something concurrently closes the handler from another thread during the flush. However, this scenario is much less likely.

aabmass

LGTM, I'll let @lzchen merge this since it's azure related. Thanks for the PR and description 🙂

lzchen · 2022-03-21T16:34:36Z

This is because atexit first calls handler.close() and then logging.shutdown(), that in turn calls handler.flush() without arguments

I might be missing something, but where does logging.shutdown() get called after handler.close()?

gukoff · 2022-03-21T16:39:39Z

@lzchen here:
https://github.com/python/cpython/blob/main/Lib/logging/__init__.py#L2201-L2203

This code is executed while importing logging, hence this hook in atexit is called after that hook (they run in reverse order).

lzchen · 2022-03-21T17:02:42Z

@gukoff
I see. This is a great find. It looks like the logic in the BaseLogHandler was done under the assumption that IT would be responsible for the shutdown logic. However it looks like the logging library itself has logic to do that, just without a timeout. Might be a bit unrelated to your change, but should we simply remove the atexit in the azurehandler and leave the responsibility to logging? logging is already calling hander.close() so actually in this case close would be called twice it seems like.

gukoff · 2022-03-21T17:45:17Z

Because shutdown() calls close() without arguments, while the custom hook in the constructor calls it with grace period, this would mean either:

Not use grace_period on close or
change the behavior of close() without arguments to wait only for grace_period instead of indefinitely.

I too thought about this option but didn't want to introduce a breaking change to close(timeout=None).

Now when I think about it, we could make such a change non-breaking with the sentinel pattern:

_sentinel = object()

...

def close(timeout=_sentinel):
  if timeout is _sentinel:  # no arguments passed -> close with the default grace_period
    timeout = self.options.grace_period
  ...

What do you think? I don't have a preference.

lzchen · 2022-03-21T18:06:30Z

@gukoff
I am fine with not making a change to the current close() for now. The guard against the empty queue should be sufficient for this use case.

lzchen · 2022-03-29T17:45:26Z

@gukoff
Could you fix the build error so we can get this merged in? :)

gukoff · 2022-03-29T20:17:20Z

@gukoff Could you fix the build error so we can get this merged in? :)

If I'm reading the CI log correctly, py39-bandit failed at the piece of code unrelated to this PR, which was fixed in this commit.

Try rerunning CI checks? ;)

gukoff · 2022-04-12T09:21:15Z

@lzchen would it be possible to release a new version with this fix in it?

lzchen · 2022-04-12T22:51:10Z

@gukoff
Will be releasing some time this week :)

)

gukoff requested review from a team, aabmass, hectorhdzg, lzchen and songy23 as code owners March 15, 2022 21:26

aabmass reviewed Mar 15, 2022

View reviewed changes

contrib/opencensus-ext-azure/opencensus/ext/azure/log_exporter/__init__.py Show resolved Hide resolved

contrib/opencensus-ext-azure/opencensus/ext/azure/log_exporter/__init__.py Outdated Show resolved Hide resolved

aabmass assigned lzchen Mar 15, 2022

gukoff force-pushed the correct-shutdown branch from 0bdd025 to 033bfdb Compare March 15, 2022 21:44

Change wording

3b2d568

aabmass approved these changes Mar 17, 2022

View reviewed changes

lzchen approved these changes Mar 21, 2022

View reviewed changes

lzchen closed this Mar 29, 2022

lzchen reopened this Mar 29, 2022

lzchen merged commit 9ffa48a into census-instrumentation:master Mar 29, 2022

csadam mentioned this pull request Jun 1, 2022

calling handler.close() from different thread causes deadlock when using AzureLogHandler #1125

Closed

lzchen added the azure Microsoft Azure label Nov 9, 2022

inirudebwoy pushed a commit to inirudebwoy/opencensus-python that referenced this pull request Jan 11, 2023

Make sure handler.flush() doesn't deadlock. (census-instrumentation#1112

f87761a

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure handler.flush() doesn't deadlock. #1112

Make sure handler.flush() doesn't deadlock. #1112

gukoff commented Mar 15, 2022 •

edited

aabmass left a comment •

edited

lzchen commented Mar 21, 2022

gukoff commented Mar 21, 2022 •

edited

lzchen commented Mar 21, 2022

gukoff commented Mar 21, 2022 •

edited

lzchen commented Mar 21, 2022

lzchen commented Mar 29, 2022

gukoff commented Mar 29, 2022

gukoff commented Apr 12, 2022

lzchen commented Apr 12, 2022

Make sure handler.flush() doesn't deadlock. #1112

Make sure handler.flush() doesn't deadlock. #1112

Conversation

gukoff commented Mar 15, 2022 • edited

aabmass left a comment • edited

Choose a reason for hiding this comment

lzchen commented Mar 21, 2022

gukoff commented Mar 21, 2022 • edited

lzchen commented Mar 21, 2022

gukoff commented Mar 21, 2022 • edited

lzchen commented Mar 21, 2022

lzchen commented Mar 29, 2022

gukoff commented Mar 29, 2022

gukoff commented Apr 12, 2022

lzchen commented Apr 12, 2022

gukoff commented Mar 15, 2022 •

edited

aabmass left a comment •

edited

gukoff commented Mar 21, 2022 •

edited

gukoff commented Mar 21, 2022 •

edited