Issue with Bootstrappers Lagging Behind During Badger Compaction #3300

smuu · 2024-04-08T10:38:50Z

Celestia Node version

v0.13.2

OS

docker/kubernetes

Install tools

No response

Others

No response

Steps to reproduce it

Monitoring and alerting systems to detect when bootstrappers lag.
Observed warnings/alerts indicating some bootstrappers were lagging behind more than two blocks for periods up to ~5 minutes.
Checked for lag incidents and found one specific instance occurring in the morning.
Reviewed logs around the time of the incident and identified badger compaction processes running concurrently with the lag.

Expected result

Bootstrappers should remain within a close range of the current block height, not lagging behind by more than one or two blocks, even during periods of high load or maintenance activities such as badger compaction.

Actual result

Multiple bootstrappers experienced significant lag, falling behind by more than two blocks for several minutes. This issue was observed numerous times per hour for different bootstrappers, but it seems that only one bootstrapper is affected at a time. The lag coincided with periods when the badger database was undergoing compaction processes.

Please take a look at the attached screenshot and logs.

Relevant log output

https://pastebin.com/kgCELKfC

Notes

musalbas · 2024-04-08T10:42:12Z

Are you sure there aren't also the same compactor logs when celestia-node is running normally? The compactor is usually running hard all the time.

If the lag was compactor related, I would expect to see the log "L0 was stalled", but I can't see any here.

smuu · 2024-04-08T10:55:16Z

For another badger compaction event, I can't see a lag.

https://pastebin.com/MKKNvYqV

Wondertan · 2024-04-09T15:12:28Z

EDIT: THIS IS UNRELATED TO THIS ISSUE. DURING THAT TIME, THE CONSENSUS NODS WERE NOT REACHABLE.

So if consensus nodes are not reachable, the bridges lag behind, meaning there is no issue in node. @smuu, lets close it then

smuu · 2024-04-09T16:11:13Z

EDIT: THIS IS UNRELATED TO THIS ISSUE. DURING THAT TIME, THE CONSENSUS NODS WERE NOT REACHABLE.

So if consensus nodes are not reachable, the bridges lag behind, meaning there is no issue in node. @smuu, lets close it then

Sorry for the confusion. I was adding another report where I thought it is related, but it's not. I deleted this comments. It does not change my original report. When consensus is not reachable, all bridge nodes lagg behind the. The issue I was reporting is about only one bootstrapper is lagging behind at a time.

walldiss · 2024-04-26T16:05:55Z

I reviewed the log you attached, and it shows that the node was slow during the fetch+store operation at the time of lagging. I examined the Store.Put metrics for that node and found that Put() times were over 10 seconds on average during that period, suggesting even higher peak values due to it being a moving average metric. The logs indicate that Badger compaction was running in the background, which likely affected the Put() performance.

Given that compaction typically doesn't cause major lag and resolves after completion, I suggest:

Increasing alert values to a >7 block delay for 2 minutes. Also, add instructions to the alert to examine Put() times as they could indicate a Badger-related issue.

As for badger issue there are 2 upcoming things that will help with it:

Pruning will significantly decrease the storage size, reducing the impact of compaction on Put() performance overall.
We will ship a new storage engine with Shwap that will phase out inverted_index, which currently occupies 99% of Badger storage. This will eliminate all compaction-related issues and improve Put() times from current 3-10 seconds to under 100 milliseconds, with benchmarks showing approximately 20 milliseconds for 2 MB blocks (x100 improvement).

smuu added the bug Something isn't working label Apr 8, 2024

github-actions bot added the external Issues created by non node team members label Apr 8, 2024

walldiss self-assigned this Apr 26, 2024

walldiss added the wait_for_shwap label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Bootstrappers Lagging Behind During Badger Compaction #3300

Issue with Bootstrappers Lagging Behind During Badger Compaction #3300

smuu commented Apr 8, 2024

musalbas commented Apr 8, 2024

smuu commented Apr 8, 2024 •

edited

Wondertan commented Apr 9, 2024

smuu commented Apr 9, 2024

walldiss commented Apr 26, 2024

Issue with Bootstrappers Lagging Behind During Badger Compaction #3300

Issue with Bootstrappers Lagging Behind During Badger Compaction #3300

Comments

smuu commented Apr 8, 2024

Celestia Node version

OS

Install tools

Others

Steps to reproduce it

Expected result

Actual result

Relevant log output

Notes

musalbas commented Apr 8, 2024

smuu commented Apr 8, 2024 • edited

Wondertan commented Apr 9, 2024

smuu commented Apr 9, 2024

walldiss commented Apr 26, 2024

smuu commented Apr 8, 2024 •

edited