gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default #1033

bwplotka · 2022-04-13T17:59:21Z

Fixes #967

I am not 100% sure we want this, but I would like to start this discussion.

On one hand, with @mknyszek we wanted new runtime/metrics naming and extra metrics to be exposed by default, so people can leverage it.

On the other hand:

Honestly speaking some of those metrics are not useful for typical Go developers and SRE responsible for running Go services. They are pretty detailed for deeper debugging systems. We enabled opt-in of those in gocollector: Added options to Go Collector for changing the collections #1031 so anyone can use it, no need for it to be the default. We also need documentation and some learning path to even understand new naming and allow building dashboards and alerts with those.
It adds ~80 new metrics to the total count (see v1.12.0 exposes some high cardinality metrics #967). It does not matter for one service. But with whole world moving to new client_golang version the cost impact can be counted in millions of $$$ (literally) in total. Especially given common Prometheus scrape configuration is deny-list, not allow-list.
We already replaced existing metrics to use runtime/metrics instead MemStats so under the hood it works great.
There are still serious bugs with new runtime/metrics bugs to be solved (fixable quickly though): go metrics are tightly coupled with used go version #994
Invalid le label value on go_gc_pauses_seconds_total_bucket #995
New GoCollector metrics in 1.12 do not pass promtool check metrics lint #1026

Let me know what you think. I would revert the default to old behaviour IMO to be safe here and save ppl from the accidental cost impact it can cause.

Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com

… metrics by default. Fixes #967 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

prymitive · 2022-04-13T18:11:07Z

I drop most of these new metrics, all the points listed above do apply to me - there’s a lot of them and I either don’t know how to use them or they don’t tell me much (because they’re so low level).

Adding drop rules to all scrapes is a pain, makes me wish there was a global drop config that applies to all targets.

mknyszek · 2022-04-13T19:15:33Z

If most of the additional cardinality is from histograms, can we just disable those? Once you eliminate those, the cardinality should be just about what it was before.

there’s a lot of them and I either don’t know how to use them or they don’t tell me much (because they’re so low level).

I'm curious to know which you find to be too low level. The majority of the non-histogram metrics are just the same metrics from MemStats with slightly different names (IMO more clear, because for example all the memory metrics are now orthogonal to one another). Lots of these are very general as well: reasonably granular memory accounting plus things that come in handy when you least expect (e.g. GC cycle rates and total memory allocated can help diagnose lots of issues after the fact, especially in long-term trends).

Coming down the pipeline in the Go world is golang/go#48409 which is going to expose new, useful metrics for identifying when the GC is being throttled to avoid death spirals. This is useful for diagnosing OOMs, linking them back to a bad configuration push or a sudden increase in memory use (or spikier memory use).

It does not scale for the Go team to go to each metrics exporter (Prometheus, etc.) and add these metrics. That's a big motivation behind slurping up all metrics programmatically. I worry that new metrics will only be added after something like this has already impacted someone, which is what I wanted to avoid in the first place.

From my perspective, a lot of the issues that cropped up over time are the result of my initial flawed integration of runtime/metrics into Prometheus. The flaws, I think, were mostly due to my inexperience with the Prometheus ecosystem generally (going back to the poor choice of histogram representation). As a result, turning off exporting new metrics by default completely in addition to switching to the old metrics feels like throwing the baby out with the bathwater.

That being said, I recognize that this integration of runtime/metrics has been causing you (the maintainers) problems, and this is a straightforward way to resolve that. I apologize for not being fully present to address these issues.

I also recognize that since this has already been merged my point here is somewhat moot. I do hope you'll reconsider the decision to not export new metrics by default (or at least the non-histogram metrics) in the future.

prymitive · 2022-04-13T21:16:08Z

I'm curious to know which you find to be too low level.

Things like:

go_gc_heap_allocs_by_size_bytes_total_bucket
go_gc_heap_frees_by_size_bytes_total_bucket
go_sched_latencies_seconds_bucket
some go_memory_classes_* or go_memstats_*

Just not sure what to do with these really. In most cases I just need cpu/memory usage (including GC stats). If resource usage is too high I usually look at pprofs to see what's going on.

kakkoyun · 2022-04-14T05:33:29Z

@mknyszek Foremost, we really appreciate your contribution and any help from the Go team. I believe we just need time to polish the implementation and make sure it satisfies most of the users' needs. And I hope you'll help us to reach that goal.

We haven't reverted the feature or anything, we just made it optional as the Go collector itself optional.

Besides minor inconvenience that we need to deal with the histogram problem, which we will have a solution soon thanks to sparse histograms (@beorn7 as the author, mentioned it a couple of times). The cardinality is a huge issue to deal with, that has direct cost consequences, out in the wild.

On top of that, we need to make another extra effort to help our user base to how to deal with these new metrics. We can provide meaningful alerts and example dashboards so that they can have immediate value out of them.

So all in all this is just a tiny bit of set back, we'll get more prepared, and then we can reconsider making the new collector default. Until then, for the users who want to enable it, the option is there. I believe we can improve the documentation and be more transparent about the pros and cons of enabling it, and users can choose if they want to opt in.

bwplotka · 2022-04-14T09:59:58Z

@kakkoyun nailed explaining it. I think we need to essentially take a more controllable introduction of new metrics. There are literally books written about what metrics you should use for monitoring Go. We can't change it overnight, even if it's "only" naming that changes. Naming is what tools depend on, not only humans, even if the name is clearer now. By implicitly adding new ones on top of old ones we are not helping with adoption much, because people in most cases won't notice new ones (no good way of discovering them) - other than having to pay more for those metrics. We need to work around this with documentation, tutorials and tooling changes.

NOTE: It's also not only histograms that are problematic. An overlap of extra metrics just to rename them and allow ppl to migrate to them is significant. Again, without the extra work mentioned above, it will not be useful for others.

Coming down the pipeline in the Go world is golang/go#48409 which is going to expose new, useful metrics for identifying when the GC is being throttled to avoid death spirals. This is useful for diagnosing OOMs, linking them back to a bad configuration push or a sudden increase in memory use (or spikier memory use).

Yes - and it would be awesome to have those! For now as opt-in. We might at some point consider more granular options or even a little bit bigger default for truly popular and useful metrics for everyone in future.

It does not scale for the Go team to go to each metrics exporter (Prometheus, etc.) and add these metrics. That's a big motivation behind slurping up all metrics programmatically. I worry that new metrics will only be added after something like this has already impacted someone, which is what I wanted to avoid in the first place.

It does not scale for Go Team, but:

This situation proved that it's so easy to make the unexpected impactful situation for users as we should probably treat metric change much more seriously than programmatic API change in compatibility terms. Go community that works on those metrics might treat it lightly and rename or add an enormous histogram because it's important for their use cases. Not many will remember that this has serious consequences to those exporters like client_golang between 1.12-now that immediately transfer all changes of runtime/metrics to thousand production clusters without any review of metric experts from exporter domain.
You are not alone. If a metric will be useful it will be added as soon as PR to Go is merged and the version with the new runtime/metric change is released. It's nice to have those things consumed on more data driven "need" base, and not because a developer thinks it's useful.

From my perspective, a lot of the issues that cropped up over time are the result of my initial flawed integration of runtime/metrics into Prometheus. The flaws, I think, were mostly due to my inexperience with the Prometheus ecosystem generally (going back to the poor choice of histogram representation). As a result, turning off exporting new metrics by default completely in addition to switching to the old metrics feels like throwing the baby out with the bathwater.

Don't worry, it's our fault (my and @kakkoyun) that we completely missed those shortcomings. And I truly believe we could fix this (and we will, before releasing of new version) within days, it's not a biggie. Still, even with fixes, we are adding too many new metrics without value (the majority of people/tooling won't know about them and won't know how to use them).

That being said, I recognize that this integration of runtime/metrics has been causing you (the maintainers) problems, and this is a straightforward way to resolve that. I apologize for not being fully present to address these issues.

Again, no need to be sorry @mknyszek your help was life-saving here. It's ultra-important to work with Go Team on what's coming to Go runtime in terms of instrumentation 💪🏽 It's was amazing to have your contributions and let's move those things forward, just a little bit more safely and opt-in - until we graduate some things to default! (:

I do hope you'll reconsider the decision to not export new metrics by default (or at least the non-histogram metrics) in the future.

Defnitely!

prymitive · 2022-04-14T12:27:17Z

NOTE: It's also not only histograms that are problematic. An overlap of extra metrics just to rename them and allow ppl to migrate to them is significant. Again, without the extra work mentioned above, it will not be useful for others.

Renaming metrics and just trying to stay on top of that is really a huge pain point for my org. It always takes us ages to upgrade node_exporter because we need to tripple-check that all metrics that got renamed were accounted for in the process. We had alerts broken for months because the path from "version of node_exporter on some server" to "we alert when bad things happen" is a long piece of string that looks like one of those crazy cartoon contraptions - you need to scrape it, then we often have recording rules to aggregate them, then we have federation that pulls aggregates, then we have more aggregation to build global aggregates, then we have alerts. On every step we need to ensure that names are correct, that we filter on correct labels and that we're scraping correct servers and services. Since you don't get any errors from Prometheus when metrics aren't there (because it's perfectly fine to ask for something that doesn't exist) it's all a very fragile ecosystem that requires care and attention. This makes renaming expensive. If you also make changes to labels (add or remove some) it's extra expensive.
This is such a big problem for us that we've built and run a sidecar on every Prometheus server that keep testing all Prometheus rules to tell us if any metrics used in queries are missing from Prometheus - github.com/cloudflare/pint - just so that we can upgrade node_exporter (or any other exporter) and if that breaks any alerts we'll have a chance of noticing. It still leaves us with lots of Grafana dashboards to update, but that's less critical.

mknyszek · 2022-04-14T16:23:12Z

Thanks for the replies and the additional context. I ultimately don't disagree with your decision and I certainly don't hold it against you. You do what you need to do for yourselves and the project. I'm also glad it sounds like we're on the same page about where we might want this to be in the future. :) It's clear to me (and this sounds like what you're ultimately getting at) that rolling this out needs a much more fully fleshed out strategy. I'll be thinking about that going forward, for Prometheus and beyond. Your feedback is invaluable.

A few replies:

@prymitive I recognize the name changes are intrusive, but it was really supposed to be a one-time thing. Things are really bad if we're making even one name change per release, without notifying ahead of time. It's supposed to be purely additive, at a slow rate. But clearly even the one-time thing is too intrusive for now.
@bwplotka @kakkoyun I didn't fully appreciate earlier how baked in the existing metric names are into everything. Looks like there's a lot of work to do in fully making the switch to runtime/metrics from MemStats and GCStats in general. To be clear, I didn't expect everyone to suddenly start using runtime/metrics, but I underestimated how far these concepts push up beyond systems like Prometheus, and how widespread they are.
@bwplotka RE: the metric overlap, earlier on in the implementation I was looking for something like an alias mechanism that could be used for migrations, akin to Go's type alias feature. Not sure if anything like that is on the roadmap for Prometheus at all, but it seems like it could save a lot of headache in even changes in application-specific metrics.
@prymitive RE: metrics being too low-level, the ones you listed surprised me. It sounds like the Go project needs to do a better job documenting and publicizing things. Part of my plan for runtime/debug: soft memory limit golang/go#48409 is to write some better GC documentation, geared toward users trying to understand a Go application, so that should help a little.
A few of you have also mentioned that there isn't enough benefit to the renaming and overlapping. I can accept given the current state of things. For some additional context, to me, there was always a clear long-term benefit for users: when something goes wrong, or you need to investigate something, you'll never have to find yourself wishing you weren't tracking some metric. This is the same principle behind why memory profiling is always on in every Go binary, and needs to be turned off explicitly. In terms of switching to runtime/metrics, there's a short-term cost for the switch to this new world (to some degree, this is unavoidable), and a long-term cost in carrying metrics that you might not need all the time. But, this cost is outweighed by the engineering costs of needing to redeploy and reproduce issues: it's always right there.

… metrics by default. (#1033) Fixes #967 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* Cut v1.12.0 Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Bump the day Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Make the Go 1.17 collector thread-safe (#969) * Use simpler locking in the Go 1.17 collector (#975) A previous PR made it so that the Go 1.17 collector locked only around uses of rmSampleBuf, but really that means that Metric values may be sent over the channel containing some values from future metrics.Read calls. While generally-speaking this isn't a problem, we lose any consistency guarantees provided by the runtime/metrics package. Also, that optimization to not just lock around all of Collect was premature. Truthfully, Collect is called relatively infrequently, and its critical path is fairly fast (10s of µs). To prove it, this change also adds a benchmark. name old time/op new time/op delta GoCollector-16 43.7µs ± 2% 43.2µs ± 2% ~ (p=0.190 n=9+9) Note that because the benchmark is single-threaded it actually looks like it might be getting *slightly* faster, because all those Collect calls for the Metrics are direct calls instead of interface calls. Signed-off-by: Michael Anthony Knyszek <mknyszek@google.com> * API client: make http reads more efficient (#976) Replace `io.ReadAll` with `bytes.Buffer.ReadFrom`. Both need to resize a buffer until they have finished reading; the former increases by 1.25x each time while the latter uses 2x. Also added a benchmark to demonstrate the benefit: name old time/op new time/op delta Client/4KB-8 35.9µs ± 4% 35.3µs ± 3% ~ (p=0.310 n=5+5) Client/50KB-8 83.1µs ± 8% 69.5µs ± 1% -16.37% (p=0.008 n=5+5) Client/1000KB-8 891µs ± 6% 750µs ± 0% -15.83% (p=0.016 n=5+4) Client/2000KB-8 1.74ms ± 2% 1.35ms ± 1% -22.72% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Client/4KB-8 20.2kB ± 0% 20.4kB ± 0% +1.26% (p=0.008 n=5+5) Client/50KB-8 218kB ± 0% 136kB ± 0% -37.65% (p=0.008 n=5+5) Client/1000KB-8 5.88MB ± 0% 2.11MB ± 0% -64.10% (p=0.008 n=5+5) Client/2000KB-8 11.7MB ± 0% 4.2MB ± 0% -63.93% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Client/4KB-8 75.0 ± 0% 72.0 ± 0% -4.00% (p=0.008 n=5+5) Client/50KB-8 109 ± 0% 98 ± 0% -10.09% (p=0.079 n=4+5) Client/1000KB-8 617 ± 0% 593 ± 0% -3.89% (p=0.008 n=5+5) Client/2000KB-8 1.13k ± 0% 1.09k ± 0% -3.27% (p=0.008 n=5+5) Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Reduce granularity of histogram buckets for Go 1.17 collector (#974) The Go runtime/metrics package currently exports extremely granular histograms. Exponentially bucket any histogram with unit "seconds" or "bytes" instead to dramatically reduce the number of buckets, and thus the number of metrics. This change also adds a test to check for expected cardinality to prevent cardinality surprises in the future. Signed-off-by: Michael Anthony Knyszek <mknyszek@google.com> * Cut v1.12.1 (#978) * Cut v1.12.1 Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Apply review suggestions Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Fix deprecated `NewBuildInfoCollector` API Update `examples/random/main.go`: `prometheus.NewBuildInfoCollector` is deprecated. Use `collectors.NewBuildInfoCollector` instead. Signed-off-by: alissa-tung <alissa-tung@outlook.com> * gocollector: Added options to Go Collector for changing the (#1031) * Renamed files. Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> * gocollector: Added options to Go Collector for diffetent collections. Fixes #983 Also: * fixed TestMemStatsEquivalence, it was noop before (: * Removed gc_cpu_fraction metric completely, since it's not working completely for Go1.17+ Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> * gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default. (#1033) Fixes #967 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com> * prometheus: Fix convention violating names for generated collector metrics (#1048) * Fix convention violating names for generated collector metrics Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Add new Go collector example Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Remove -Inf buckets from go collector histograms (#1049) * Remove -Inf buckets from go collector histograms Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update prometheus/collectors/go_collector_latest_test.go Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Simplify Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> * Cut v1.12.2 (#1052) * Cut v1.12.2 Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Apply suggestions Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update CHANGELOG.md Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Co-authored-by: Kemal Akkoyun <kakkoyun@gmail.com> Co-authored-by: Michael Knyszek <mknyszek@google.com> Co-authored-by: Bryan Boreham <bjboreham@gmail.com> Co-authored-by: Kemal Akkoyun <kakkoyun@users.noreply.github.com> Co-authored-by: alissa-tung <alissa-tung@outlook.com> Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>

gocollector: Reverted client_golang v1.12 addition of runtime/metrics…

0ecdf4c

… metrics by default. Fixes #967 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka mentioned this pull request Apr 13, 2022

v1.12.0 exposes some high cardinality metrics #967

Closed

bwplotka changed the title ~~gocollector: Reverted client_golang v1.12 addition of runtime/metrics…~~ gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default Apr 13, 2022

bwplotka requested review from beorn7 and kakkoyun April 13, 2022 18:00

kakkoyun approved these changes Apr 13, 2022

View reviewed changes

roidelapluie approved these changes Apr 13, 2022

View reviewed changes

kakkoyun merged commit 11ee9ad into main Apr 13, 2022

kakkoyun deleted the revert-runtime-metrics-default branch April 13, 2022 18:43

chancez mentioned this pull request Apr 19, 2022

CI: ci-external-workload: Cilium API handler panicked cilium/cilium#19425

Closed

kakkoyun mentioned this pull request May 2, 2022

Invalid le label value on go_gc_pauses_seconds_total_bucket #995

Closed

kakkoyun pushed a commit that referenced this pull request May 13, 2022

gocollector: Reverted client_golang v1.12 addition of runtime/metrics…

7eb9d11

… metrics by default. (#1033) Fixes #967 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

SuperQ mentioned this pull request Jul 15, 2022

Add option to control Go metrics in the default collector. #1089

Closed

aknuds1 mentioned this pull request Aug 24, 2022

WIP: Metrics: Re-enable metrics removed in prometheus/client_golang 1.12.2 grafana/mimir#2822

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default #1033

gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default #1033

bwplotka commented Apr 13, 2022 •

edited

prymitive commented Apr 13, 2022

mknyszek commented Apr 13, 2022

prymitive commented Apr 13, 2022

kakkoyun commented Apr 14, 2022

bwplotka commented Apr 14, 2022

prymitive commented Apr 14, 2022

mknyszek commented Apr 14, 2022

gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default #1033

gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default #1033

Conversation

bwplotka commented Apr 13, 2022 • edited

prymitive commented Apr 13, 2022

mknyszek commented Apr 13, 2022

prymitive commented Apr 13, 2022

kakkoyun commented Apr 14, 2022

bwplotka commented Apr 14, 2022

prymitive commented Apr 14, 2022

mknyszek commented Apr 14, 2022

bwplotka commented Apr 13, 2022 •

edited