Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't expire Prometheus metrics that have been explicitly defined #123

Merged
merged 2 commits into from Apr 22, 2021

Conversation

banks
Copy link
Member

@banks banks commented Apr 14, 2021

#120 did excellent work to allow go-metrics consumers like Consul to define important metrics explicitly so they would not expire if they are not updated. The primary motivation was around ensuring we always provide metric summaries and descriptions for the most important metrics even if they've not changed for a while which was achieved.

The Expiration option in go-metrics though was still acting on Summaries and Gauges - effectively resetting them to NaN after the expiry time. This was intentional per the PR and comment but has a few downsides:

  1. Gauges are often used for things that don't change often but might be relevant to know, for example the size of a queue.
    • In some implementations this is polled regularly to emit the gauge which worked fine provided the poll frequency is faster than the Expiry frequency. Unfortunately the Expiry frequency is exposed to user configuration while poll frequency isn't which can result in gauges disappearing even though they are still valid if the user sets a low expiry value (which Consul's docs currently encourage).
    • In other implementation it might be more practical and efficient to just update a gauge in the one place that the value changes (for example when pushing or popping a queue) which if that happens less frequently than the expiry time means the gauge value keeps disappearing even though it's still relevant.
  2. Summary values were being reset after the expiration time by observing a single NaN sample. While this works to force all the statistics in the summary to NaN, it doesn't really reset it since the number of samples considered is still the total number of samples in the buffer - all the old samples plus the NaN sample. Also, as far as I can tell, the NaN sample remains in the buffer for the next time window so even later legitimate samples will not really be shown correctly until the NaN is aged out. We hard code the summaries to have a 10 second window currently so this was not a huge deal but it was also not an especially useful "reset" semantic.

Revisiting the purpose of the Expiration option, it's really to prevent "ephemeral" metrics that are produced a few times and then never produced again from being held in memory forever. By definition, metrics that are explicitly defined by the application are never ephemeral and will always consume memory in that definition map, so expiring them at all doesn't seem to be valuable. There is nothing gained from "forgetting" the value of any of these metrics and outputting NaN for some time until more values are recorded as far as I can see.

For context, I've just added a couple of gauges to Raft (hashicorp/raft#452, hashicorp/raft#454) that are critical to monitoring for a known and severe failure mode in raft. These are not periodically emitted in a loop as that would be less efficient, but it's hard to use them to monitor for this case effectively if you can't see constant output values since the events that change them are often rare (restoring from snapshot typically only happens on a restart which might have been weeks ago). I thought by explicitly defining them in Consul I'd get constant output but the NaN reset prevents that and on considering it further, I can't think of a case where the reset is actually desirable behaviour at least for explicitly defined metrics.

@mkcp would love your thoughts though - perhaps I missed something where it is important that these reset even for explicitly defined values?

I added a test to ensure that even after expiring the metrics are not reset.

Copy link
Contributor

@ncabatoff ncabatoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for providing the detailed context!

Co-authored-by: Nick Cabatoff <ncabatoff@hashicorp.com>
@dnephin
Copy link
Contributor

dnephin commented Sep 1, 2021

A user has reported that this change may be causing problems in Consul: hashicorp/consul#10730 (comment)

I'm not sure yet if we need to handle this is Consul by reporting an explicit NaN of followers, or if this change is not working as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants