Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close eventfd shutdown/wakeup race by closely tracking epoll edges #9586

Merged
merged 3 commits into from Sep 23, 2019

Conversation

normanmaurer
Copy link
Member

Motivation

This is another iteration of #9476.

Modifications

Instead of maintaining a count of all writes performed and then using
reads during shutdown to ensure all are accounted for, just set a flag
after each write and don't reset it until the corresponding event has
been returned from epoll_wait.

This requires that while a write is still pending we don't reset
wakenUp, i.e. continue to block writes from the wakeup() method.

Result

Race condition eliminated. Fixes #9362

Co-authored-by: Norman Maurer norman_maurer@apple.com

Motivation

This is another iteration of #9476.

Modifications

Instead of maintaining a count of all writes performed and then using
reads during shutdown to ensure all are accounted for, just set a flag
after each write and don't reset it until the corresponding event has
been returned from epoll_wait.

This requires that while a write is still pending we don't reset
wakenUp, i.e. continue to block writes from the wakeup() method.

Result

Race condition eliminated. Fixes #9362

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
@normanmaurer
Copy link
Member Author

@johnou FYI...

"re-reverted" another one...

@normanmaurer normanmaurer added this to the 4.1.42.Final milestone Sep 20, 2019

// Ordered store is sufficient here since the only access outside this
// thread is a getAndSet in the wakeup() method
wakenUp.lazySet(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was lazySet safe here based on what @njhill said the other day?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnou good question. To distil the problem, say we have fields

final AtomicInteger x = new AtomicInteger(1), y = new AtomicInteger(0);

Thread 1 program order:

x.lazySet(0);
if (y.get() == 0) { print("sleep"); }

Thread 2 program order:

y.compareAndSet(0, 1); // (always succeeds)
if (x.compareAndSet(0, 1)) { print("wake"); }

Is it possible for "sleep" to be printed without "wake" also being printed? I now think the answer is yes, even more so after reading this. Which means that lazySet isn't safe, but would be good to get confirmation from others... @franz1981?? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've said right: it isn't safe and set would be a better choice for a total ordered operation (among the others) :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@franz1981 please review the rest as well

@njhill
Copy link
Member

njhill commented Sep 20, 2019

I pushed a commit to change the lazySet to set (can revert based on other feedback if necessary). Also added a warning log message in the case we timeout when expecting to receive an imminent eventfd EPOLLIN event (suggested by @ejona86 the first time around but I forgot to add it then).

@njhill
Copy link
Member

njhill commented Sep 20, 2019

Just pushed another commit 8523b5a:

  • It's better not to touch the timerfd when doing the "timeboxed" wait and instead rely on epoll's own timeout (this is what the original version did but that was mingled with changes from the reverted timerfd-decoupling PRs).
  • Simplified the preexisting native epollWait method which resets timerfd - there's no need to read from it after it fires in ET mode

@normanmaurer would be good to sanity test again with these updates when you have a chance!

@normanmaurer
Copy link
Member Author

@netty-bot test this please

@normanmaurer
Copy link
Member Author

@njhill all works... just merged

@njhill
Copy link
Member

njhill commented Sep 24, 2019

@normanmaurer would you like me to port this and #9397 to master?

@normanmaurer
Copy link
Member Author

@njhill yep please

@normanmaurer normanmaurer deleted the epoll_eventfd_race branch September 24, 2019 16:04
njhill added a commit to njhill/netty that referenced this pull request Sep 25, 2019
(netty#9586)

Motivation

This is another iteration of netty#9476.

Modifications

Instead of maintaining a count of all writes performed and then using
reads during shutdown to ensure all are accounted for, just set a flag
after each write and don't reset it until the corresponding event has
been returned from epoll_wait.

This requires that while a write is still pending we don't reset
wakenUp, i.e. continue to block writes from the wakeup() method.

Result

Race condition eliminated. Fixes netty#9362


Co-authored-by: Norman Maurer <norman_maurer@apple.com>
@njhill
Copy link
Member

njhill commented Sep 25, 2019

@normanmaurer here is master-based commit for this one: 8ed2222, I see you had actually already done #9397.

@njhill
Copy link
Member

njhill commented Sep 26, 2019

@normanmaurer should I open a PR from the above or will you cherrypick?

@normanmaurer
Copy link
Member Author

@njhill pr please... thanks

njhill added a commit to njhill/netty that referenced this pull request Sep 27, 2019
Motivation

This is a vestige that was removed in the original PR netty#9535 before it
was reverted, but we missed it when re-applying in netty#9586.

It means there is a possible race condition because a wakeup event could
be missed while shutting down, but the consequences aren't serious since
there's a 1 second safeguard timeout when waiting for it.

Modification

Remove call to epollWaitNow() in EpollEventLoop#closeAll()

Result

Cleanup redundant code, avoid shutdown delay race condition
normanmaurer added a commit that referenced this pull request Sep 27, 2019
…9586) (#9612)


Motivation

This is another iteration of #9476.

Modifications

Instead of maintaining a count of all writes performed and then using
reads during shutdown to ensure all are accounted for, just set a flag
after each write and don't reset it until the corresponding event has
been returned from epoll_wait.

This requires that while a write is still pending we don't reset
wakenUp, i.e. continue to block writes from the wakeup() method.

Result

Race condition eliminated. Fixes #9362


Co-authored-by: Norman Maurer <norman_maurer@apple.com>
normanmaurer pushed a commit that referenced this pull request Oct 7, 2019
)

Motivation

This is a vestige that was removed in the original PR #9535 before it
was reverted, but we missed it when re-applying in #9586.

It means there is a possible race condition because a wakeup event could
be missed while shutting down, but the consequences aren't serious since
there's a 1 second safeguard timeout when waiting for it.

Modification

Remove call to epollWaitNow() in EpollEventLoop#closeAll()

Result

Cleanup redundant code, avoid shutdown delay race condition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shutdown race in EpollEventLoop
4 participants