Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinstate decoupled timerfd logic in EpollEventLoop #9590

Closed
wants to merge 1 commit into from

Conversation

njhill
Copy link
Member

@njhill njhill commented Sep 21, 2019

Motivation

This reinstates the parts of 1fa7a5e and a22d4ba that were reverted in 7f39142 (applies to EpollEventLoop class only), with some minor changes.

It sounds like this still exhibits some testsuite failures even with the lazySet fixes, but I thought it would be good to have a PR with the "latest" version to continue to test/debug.

Modification

Changes are the same as before apart a few added comments and the following in EpollEventLoop#checkScheduleTaskQueueForNewDelay:

  • Use set instead of lazySet to avoid race condition
  • Avoid arming timer and entering epoll wait if a task has already expired
  • Switch order of calling setTimerFd and updating nextDeadlineNanos within the sync block

The latter two are perf refinements and should not have contributed to the issues seen.

Result

Performance benefits from having timerfd updates decoupled from the event loop, as originally introduced by @Scottmitch

@netty-bot
Copy link

Can one of the admins verify this patch?

@njhill
Copy link
Member Author

njhill commented Sep 22, 2019

@normanmaurer I've attempted to stress this a bit and haven't managed to break it. Would be good if you could re-test with your own testsuite when you have time.

@normanmaurer
Copy link
Member

@njhill can you please rebase ?

@normanmaurer
Copy link
Member

@njhill also just re-run the test suite and have the same issue as before... I hope I will have some time this week to look into it but I wouldn't hold my breath for it

Motivation

This reinstates the parts of 1fa7a5e
and a22d4ba that were reverted in
7f39142 (applies to EpollEventLoop
class only), with some minor changes.

It sounds like this still exhibits some testsuite failures even with the
lazySet fixes, but I thought it would be good to have a PR with the
"latest" version to continue to test/debug.

Modification

Changes are the same as before apart a few added comments and the
following in EpollEventLoop#checkScheduleTaskQueueForNewDelay:
- Use set instead of lazySet to avoid race condition
- Avoid arming timer and entering epoll wait if a task has already
expired
- Switch order of calling setTimerFd and updating the nextDeadlineNanos
AtomicLong within the sync block

The latter two are perf refinements and should not have contributed to
the issues seen.

Result

Performance benefits from having timerfd updates decoupled from the
event loop, as originally introduced by @Scottmitch
@njhill njhill changed the base branch from epoll_eventfd_race to 4.1 September 23, 2019 17:51
@njhill
Copy link
Member Author

njhill commented Sep 23, 2019

@normanmaurer thanks for re-running the test, now rebased on 4.1, understand re the time to debug. Good that we've narrowed it at least.

I wonder if you could provide any other clues as to the nature of the failures/tests... do I understand correctly that it's essentially delayed tasks scheduled on the EL timing-out? Might it involve channel/EL shutdown by any chance?

@normanmaurer
Copy link
Member

@njhill is this still a thing after we merged all the other stuff ?

@njhill
Copy link
Member Author

njhill commented Oct 15, 2019

@normanmaurer yes there's some remaining stuff which was originally done in #7834:

  1. Execute expired scheduled tasks directly rather than first dumping them into the main task queue
  2. Get rid of IO ratio
  3. Make required timerfd updates directly instead of waking event loop, for scheduled tasks submitted while the EL is waiting

Maybe it would be better to deal with these individually, and open separate PRs for those we think still worthwhile?

I think we should at least do (1) ... the logic is actually already there (wasn't reverted) just not currently called. (2) would also be a nice simplification but I'm unsure whether there are any possible negative implications. We were going ahead with it before (until reverted for different reasons), so I guess it should be ok. There will still be a cap on how many times the task queue is processed before checking IO again.

The relative value of (3) I think is lower now that #9605 is merged, but it might still be beneficial since some syscalls should be saved in some specific circumstances. I'm not sure if those would now be too narrow to be worth it though. The downside is some additional complexity and probably some more expensive atomic operations on the hot paths (to support the added coordination needed).

Would be good to get thoughts from you and @Scottmitch and anyone else on these decisions...

@normanmaurer
Copy link
Member

@njhill lets go for 1) and let us just "dismiss" the rest for now.

@njhill njhill mentioned this pull request Nov 25, 2019
@netty-bot
Copy link

Can one of the admins verify this patch?

@normanmaurer
Copy link
Member

I will just close this for now ... If we want to pick this up again lets reopen a new pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants