fix: Inaccurate Bucket Intervals for block_interval_seconds Metric #1307 #1308

jevonearth · 2024-04-16T00:56:42Z

Description

This fix sets appropriate histogram buckets for the celestia_consensus_block_interval_seconds_bucket Prometheus metric.

It addresses issue #1307

I also took a shot at improving the HELP description of the metric. I don't know if the metric records the time based on when the node sees the block or if it measures the time between blocks based on block timestamps. If it is the latter, I should adjust the HELP description to reflect that.

Here's a screenshot of what a Grafana graph looks like before/after updating this metric. Based on mochanet data.

Metrics before the change look like this;

# HELP celestia_consensus_block_interval_seconds Time between this and the last block.
# TYPE celestia_consensus_block_interval_seconds histogram
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.005"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.01"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.025"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.05"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.1"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.25"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.5"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="1"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="2.5"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="5"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="10"} 1028
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="+Inf"} 29770

Metrics after the fix look like this

# HELP celestia_consensus_block_interval_seconds Histogram of time intervals in seconds between consecutive blocks, capturing the distribution of block times as observed by this node.
# TYPE celestia_consensus_block_interval_seconds histogram
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="10"} 9
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="11"} 12
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="12"} 830
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="13"} 885
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="14"} 886
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="15"} 886
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="20"} 886
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="25"} 897
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="30"} 897
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="40"} 897
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="50"} 897
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="60"} 897
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="+Inf"} 897

PR checklist

I have not completed the checklist tasks. I can do these tomorrow where relevant

Tests written/updated
Changelog entry added in .changelog (we use
unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments

cmwaters

I would almost feel like using a gauge is better than a histogram to see the block interval.

consensus/metrics.go

@cmwaters

Adjust buckets as per @cmwaters suggestion. Co-authored-by: Callum Waters <cmwaters19@gmail.com>

jevonearth · 2024-04-17T15:20:52Z

I would almost feel like using a gauge is better than a histogram to see the block interval.

Hi @cmwaters,

I think Histogram for this metric is overkill, but if we were to change it, I believe a Summary metric type would make more sense. A Gauge is useful for point-in-time status. A Summary metric gives us better visibility of data over time. This makes summaries more suited for understanding trends and variations in block intervals.

I'm open to changing the metric type; it's not hard to do, and the previous state of this metric made the resulting data in Prometheus useless anyway.

But with the change in this PR, the histogram metric type is now also serviceable. :)

(I posted this reply in the wrong place yesterday, reposting it on the main thread instead.)

fix: Inaccurate Bucket Intervals for block_interval_seconds Metric

5201fc6

jevonearth requested a review from a team as a code owner April 16, 2024 00:56

jevonearth requested review from ramin and staheri14 and removed request for a team April 16, 2024 00:56

cmwaters reviewed Apr 16, 2024

View reviewed changes

consensus/metrics.go Outdated Show resolved Hide resolved

rootulp assigned jevonearth Apr 16, 2024

rootulp previously approved these changes Apr 16, 2024

View reviewed changes

jevonearth dismissed rootulp’s stale review via fa79396 April 17, 2024 15:19

Update consensus/metrics.go

fa79396

Adjust buckets as per @cmwaters suggestion. Co-authored-by: Callum Waters <cmwaters19@gmail.com>

rootulp approved these changes Apr 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Inaccurate Bucket Intervals for block_interval_seconds Metric #1307 #1308

fix: Inaccurate Bucket Intervals for block_interval_seconds Metric #1307 #1308

jevonearth commented Apr 16, 2024 •

edited

cmwaters left a comment

jevonearth commented Apr 17, 2024

fix: Inaccurate Bucket Intervals for block_interval_seconds Metric #1307 #1308

Are you sure you want to change the base?

fix: Inaccurate Bucket Intervals for block_interval_seconds Metric #1307 #1308

Conversation

jevonearth commented Apr 16, 2024 • edited

Description

PR checklist

cmwaters left a comment

Choose a reason for hiding this comment

jevonearth commented Apr 17, 2024

jevonearth commented Apr 16, 2024 •

edited