Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpvClient loses the connection chain and stops providing blocks #223

Open
mariocynicys opened this issue Jun 26, 2023 · 7 comments · May be fixed by #234
Open

SpvClient loses the connection chain and stops providing blocks #223

mariocynicys opened this issue Jun 26, 2023 · 7 comments · May be fixed by #234
Labels
good first issue Good for newcomers

Comments

@mariocynicys
Copy link
Collaborator

This is a hard to produce issue but basically what happens is that bitcoind prunes old block which aren't yet delivered to the tower. Thus the tower stops connecting blocks (watching).

Repro:

  • Keep your node shut for a couple of weeks/days (yeah really)
  • Start the node and the tower together (so the tower doesn't lag behind the prune length)

What will happen:

  • The tower will be able to go past the 100 blocks check when it's bootstrapping since these blocks aren't pruned yet
  • bitcoind should be (so) slow responding to RPC calls after sometime, probably because it's busy validating blocks
  • The tower will hang waiting for RPC calls from bitcoind but bitcoind will move quickly that it prunes blocks that the tower hasn't yet received.

At this point spv_client.poll_best_tip will stop connecting blocks (blocks are connected sequentially, if one is missing we can't connect later ones), which is indicated by the boolean returned.

    /// Polls for the best tip and updates the chain listener with any connected or disconnected
    /// blocks accordingly.
    ///
    /// Returns the best polled chain tip relative to the previous best known tip and whether any
    /// blocks were indeed connected or disconnected.
	pub async fn poll_best_tip(&mut self) -> BlockSourceResult<(ChainTip, bool)> { ... }

The tower will not get any blocks after this point nor will it report errors.

Such an issue could be triggered with the loss of internet connection of some long time. So it might be worth resolving it automatically and not requiring manual interference.

@mariocynicys
Copy link
Collaborator Author

image

It's all Updating best tip logs after this.
Note: This tower isn't the master one.

@mariocynicys mariocynicys added the good first issue Good for newcomers label Jun 26, 2023
@anipaul2
Copy link
Contributor

Are there any existing error-handling mechanisms in place within the SpvClient or related components to handle scenarios where blocks are not delivered due to pruning?

@mariocynicys
Copy link
Collaborator Author

Are there any existing error-handling mechanisms in place within the SpvClient or related components to handle scenarios where blocks are not delivered due to pruning?

They don't consider it errors per se, but they do report it back to the caller with the boolean in BlockSourceResult<(ChainTip, bool)>. So we should be able to recover-from/react-to that.

@anipaul2
Copy link
Contributor

If the boolean value indicates that blocks were disconnected, can we retry an attempt to fetch and connect the missing the blocks again?

@mariocynicys
Copy link
Collaborator Author

If the boolean value indicates that blocks were disconnected, can we retry an attempt to fetch and connect the missing the blocks again?

The expected action here is blocks getting connected or disconnected. If one of these things happen the boolean should be true. If the best tip fetched but without any blocks being connected or disconnected, that's the bad case.
Retrying will probably do nothing since the blocks are pruned already. We can either report the issue to the user or automatically move the spv client's tip forward to a non-purged block and risk not connecting all the block in between.

@sr-gi
Copy link
Member

sr-gi commented Jun 27, 2023

We may be able to fix this by checking whether we are in IBD or not. bitcoind defaults to report to be in IBD if the node is started and the chain is lagging behind for longer than 24h (the backend tip stalled for more than a day). This is checked only on bootstrap and once it latches to false it will not change back to true while running, even if all peers disconnect from us and we don't get any data for longer than a day. This should not be an issue for us though.

We could either deny running if that is the case or wait until the backend catches up. This is reported by getblockchaininfo which we happen to currently call when starting the tower in order to check what chain we're running in. We may need to update the wapper to return both the chain and whether we are in IBD. Furthermore, we could have some specially handling case if we are in regtest or something, given this may not be as relevant in that case and may trigger more often than not.

@sr-gi
Copy link
Member

sr-gi commented Jun 28, 2023

Here's a PoC for this: https://github.com/sr-gi/rust-teos/tree/ibd-abort. @mariocynicys if you still have a copy of the chain that was triggering this error, would you mind testing it out (assuming you're ok with the approach)?

@sr-gi sr-gi linked a pull request Jul 27, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants