Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure handler.flush() doesn't deadlock. #1112

Merged
merged 2 commits into from
Mar 29, 2022

Conversation

gukoff
Copy link
Contributor

@gukoff gukoff commented Mar 15, 2022

Currently flush() deadlocks during process termination if there's any unsent messages in the queue.

This is because atexit first calls handler.close() and then logging.shutdown(), that in turn calls handler.flush() without arguments. I.e. handler.close() kills the worker, and then handler.flush() forever waits for the dead worker to send the messages from the queue.

Stacktrace dump by py-spy of the application in a deadlock:

Thread 8900 (idle): "MainThread"
    wait (threading.py:296)
    wait (threading.py:552)
    wait (opencensus\common\schedule\__init__.py:75)
    flush (opencensus\common\schedule\__init__.py:127)
    flush (opencensus\ext\azure\log_exporter\__init__.py:109)
    shutdown (logging\__init__.py:2036)

After this change, the deadlock is still possible if another thread concurrently closes the handler during the flush. However, this scenario is much less likely.

Currently it deadlocks during process termination, when
atexit first calls handler.close() and then logging.shutdown(),
that in turn calls handler.flush() without arguments.

handler.close() kills the worker, and then handler.flush() forever waits for the dead worker to send the messages from the queue.

After this change, the deadlock is still possible if something concurrently closes the handler from another thread during the flush. However, this scenario is much less likely.
Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll let @lzchen merge this since it's azure related. Thanks for the PR and description 🙂

@lzchen
Copy link
Contributor

lzchen commented Mar 21, 2022

This is because atexit first calls handler.close() and then logging.shutdown(), that in turn calls handler.flush() without arguments

I might be missing something, but where does logging.shutdown() get called after handler.close()?

@gukoff
Copy link
Contributor Author

gukoff commented Mar 21, 2022

@lzchen here:
https://github.com/python/cpython/blob/main/Lib/logging/__init__.py#L2201-L2203

This code is executed while importing logging, hence this hook in atexit is called after that hook (they run in reverse order).

@lzchen
Copy link
Contributor

lzchen commented Mar 21, 2022

@gukoff
I see. This is a great find. It looks like the logic in the BaseLogHandler was done under the assumption that IT would be responsible for the shutdown logic. However it looks like the logging library itself has logic to do that, just without a timeout. Might be a bit unrelated to your change, but should we simply remove the atexit in the azurehandler and leave the responsibility to logging? logging is already calling hander.close() so actually in this case close would be called twice it seems like.

@gukoff
Copy link
Contributor Author

gukoff commented Mar 21, 2022

Because shutdown() calls close() without arguments, while the custom hook in the constructor calls it with grace period, this would mean either:

  1. Not use grace_period on close or
  2. change the behavior of close() without arguments to wait only for grace_period instead of indefinitely.

I too thought about this option but didn't want to introduce a breaking change to close(timeout=None).

Now when I think about it, we could make such a change non-breaking with the sentinel pattern:

_sentinel = object()

...

def close(timeout=_sentinel):
  if timeout is _sentinel:  # no arguments passed -> close with the default grace_period
    timeout = self.options.grace_period
  ...

What do you think? I don't have a preference.

@lzchen
Copy link
Contributor

lzchen commented Mar 21, 2022

@gukoff
I am fine with not making a change to the current close() for now. The guard against the empty queue should be sufficient for this use case.

@lzchen
Copy link
Contributor

lzchen commented Mar 29, 2022

@gukoff
Could you fix the build error so we can get this merged in? :)

@gukoff
Copy link
Contributor Author

gukoff commented Mar 29, 2022

@gukoff Could you fix the build error so we can get this merged in? :)

If I'm reading the CI log correctly, py39-bandit failed at the piece of code unrelated to this PR, which was fixed in this commit.

Try rerunning CI checks? ;)

@lzchen lzchen closed this Mar 29, 2022
@lzchen lzchen reopened this Mar 29, 2022
@lzchen lzchen merged commit 9ffa48a into census-instrumentation:master Mar 29, 2022
@gukoff
Copy link
Contributor Author

gukoff commented Apr 12, 2022

@lzchen would it be possible to release a new version with this fix in it?

@lzchen
Copy link
Contributor

lzchen commented Apr 12, 2022

@gukoff
Will be releasing some time this week :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
azure Microsoft Azure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants