Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Inaccurate Bucket Intervals for block_interval_seconds Metric #1307 #1308

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jevonearth
Copy link
Contributor

@jevonearth jevonearth commented Apr 16, 2024

Description

This fix sets appropriate histogram buckets for the celestia_consensus_block_interval_seconds_bucket Prometheus metric.

It addresses issue #1307

I also took a shot at improving the HELP description of the metric. I don't know if the metric records the time based on when the node sees the block or if it measures the time between blocks based on block timestamps. If it is the latter, I should adjust the HELP description to reflect that.

Here's a screenshot of what a Grafana graph looks like before/after updating this metric. Based on mochanet data.

image

Metrics before the change look like this;

# HELP celestia_consensus_block_interval_seconds Time between this and the last block.
# TYPE celestia_consensus_block_interval_seconds histogram
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.005"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.01"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.025"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.05"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.1"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.25"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="0.5"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="1"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="2.5"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="5"} 0
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="10"} 1028
celestia_consensus_block_interval_seconds_bucket{chain_id="celestia",version="1.7.0",le="+Inf"} 29770

Metrics after the fix look like this

# HELP celestia_consensus_block_interval_seconds Histogram of time intervals in seconds between consecutive blocks, capturing the distribution of block times as observed by this node.
# TYPE celestia_consensus_block_interval_seconds histogram
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="10"} 9
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="11"} 12
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="12"} 830
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="13"} 885
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="14"} 886
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="15"} 886
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="20"} 886
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="25"} 897
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="30"} 897
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="40"} 897
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="50"} 897
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="60"} 897
celestia_consensus_block_interval_seconds_bucket{chain_id="mocha-4",version="1.7.0",le="+Inf"} 897

PR checklist

I have not completed the checklist tasks. I can do these tomorrow where relevant

  • Tests written/updated
  • Changelog entry added in .changelog (we use
    unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments

@jevonearth jevonearth requested a review from a team as a code owner April 16, 2024 00:56
@jevonearth jevonearth requested review from ramin and staheri14 and removed request for a team April 16, 2024 00:56
Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would almost feel like using a gauge is better than a histogram to see the block interval.

consensus/metrics.go Outdated Show resolved Hide resolved
rootulp
rootulp previously approved these changes Apr 16, 2024
Adjust buckets as per @cmwaters suggestion.

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
@jevonearth
Copy link
Contributor Author

I would almost feel like using a gauge is better than a histogram to see the block interval.

Hi @cmwaters,

I think Histogram for this metric is overkill, but if we were to change it, I believe a Summary metric type would make more sense. A Gauge is useful for point-in-time status. A Summary metric gives us better visibility of data over time. This makes summaries more suited for understanding trends and variations in block intervals.

I'm open to changing the metric type; it's not hard to do, and the previous state of this metric made the resulting data in Prometheus useless anyway.

But with the change in this PR, the histogram metric type is now also serviceable. :)

(I posted this reply in the wrong place yesterday, reposting it on the main thread instead.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants