-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Bootstrappers Lagging Behind During Badger Compaction #3300
Comments
Are you sure there aren't also the same compactor logs when celestia-node is running normally? The compactor is usually running hard all the time. If the lag was compactor related, I would expect to see the log "L0 was stalled", but I can't see any here. |
For another badger compaction event, I can't see a lag. |
So if consensus nodes are not reachable, the bridges lag behind, meaning there is no issue in node. @smuu, lets close it then |
Sorry for the confusion. I was adding another report where I thought it is related, but it's not. I deleted this comments. It does not change my original report. When consensus is not reachable, all bridge nodes lagg behind the. The issue I was reporting is about only one bootstrapper is lagging behind at a time. |
I reviewed the log you attached, and it shows that the node was slow during the fetch+store operation at the time of lagging. I examined the Store.Put metrics for that node and found that Put() times were over 10 seconds on average during that period, suggesting even higher peak values due to it being a moving average metric. The logs indicate that Badger compaction was running in the background, which likely affected the Put() performance. Given that compaction typically doesn't cause major lag and resolves after completion, I suggest: Increasing alert values to a >7 block delay for 2 minutes. Also, add instructions to the alert to examine Put() times as they could indicate a Badger-related issue. As for badger issue there are 2 upcoming things that will help with it:
|
Celestia Node version
v0.13.2
OS
docker/kubernetes
Install tools
No response
Others
No response
Steps to reproduce it
Expected result
Bootstrappers should remain within a close range of the current block height, not lagging behind by more than one or two blocks, even during periods of high load or maintenance activities such as badger compaction.
Actual result
Multiple bootstrappers experienced significant lag, falling behind by more than two blocks for several minutes. This issue was observed numerous times per hour for different bootstrappers, but it seems that only one bootstrapper is affected at a time. The lag coincided with periods when the badger database was undergoing compaction processes.
Please take a look at the attached screenshot and logs.
Relevant log output
Notes
The text was updated successfully, but these errors were encountered: