-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing: shutdown chanrouter correctly. #8497
base: master
Are you sure you want to change the base?
Conversation
Important Auto Review SkippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
But this is just part of the fix why the other node is not able to sync the graph to the chain, we definitely need to retry the blockfetch and not fail immediately if we cannot get the block from the first peer. This issue is already tracked in this issue: |
801e50f
to
3223097
Compare
3223097
to
85a52aa
Compare
Swapped the order when we add the cleanup |
routing/router.go
Outdated
@@ -539,6 +539,13 @@ func (r *ChannelRouter) Start() error { | |||
return nil | |||
} | |||
|
|||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this given the change of orders in the following commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree we can remove this now.
if err := s.createNewHiddenService(); err != nil { | ||
startErr = err | ||
return | ||
} | ||
cleanup = cleanup.add(s.torController.Stop) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's also the connMsgr
cleanup below
@@ -1847,173 +1847,174 @@ func (s *server) Start() error { | |||
cleanup := cleaner{} | |||
|
|||
s.start.Do(func() { | |||
cleanup = cleanup.add(s.customMessageServer.Stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is interesting...need to take a second look to see if this can be done safely, as there's some stop pattern that requires a certain state to be reached before it can stop, such as,
Lines 201 to 205 in acc5951
func (m *PingManager) Stop() error { | |
if m.pingTicker == nil { | |
return errors.New("PingManager cannot be stopped because it " + | |
"isn't running") | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine, because we just call the stop function, and if we cannot stop it we just print a warning and continue with the cleanup of the other subsystems. We cannot know at which state the startup might fail so therefore I think we should precautiously call the stop function.
85a52aa
to
643f56f
Compare
Make sure that each subsystem only starts and stop once. This makes sure we don't close e.g. quit channels twice.
This commit does two things. It starts up the server in a way that it can be interrupted and shutdown gracefully. Moreover it makes sure that subsystems clean themselves up when they fail to start. This makes sure that depending subsytems can shutdown gracefully as well and the shutdown process is not stuck.
f21e62c
to
8c831e5
Compare
@yyforyongyu while adding the interruptibility to the startup of the server, I figured out that we need to make sure that each stop call is atomic (only happens once) otherwise we first call it in the But I think when the tests pass the switch of the cleanup order should have no side effects and can prevent some cases where subsystems depend on each other and therefore cannot shutdown correctly in case on of them does not close the |
@ziggie1984, remember to re-request review from reviewers when ready |
Fixes #8489EDIT: Fixes #8721
So in the above linked issue, the channel graph could not be synced correctly so the ChanRouter:
...
so the 34 query failed and therefore the startup of the chanrouter failed as well.
We fail here and never call the
Stop
function of the channel router.https://github.com/lightningnetwork/lnd/blob/master/routing/router.go#L628
When cleaning up all the other subsystems we get stuck however:
because we don't close the quit channel of the channel router and therefore the
Authenticated Gossiper
cannot stop as well so the cleanup process is stuck holding up the shutdown of all subsystems, causing some sideeffects because other subsystems are still running.Goroutine Dump:
So we need to think how to prevent those situations, because I think we don't close the
quit channel
for almost all subsystems when the start fails.