Enhancement for metricsConfig redesign #2182

suhanime · 2024-01-10T19:07:04Z

xref: https://issues.redhat.com/browse/HIVE-2344

openshift-ci · 2024-01-10T19:07:20Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

docs/enhancements/metricsConfig_redesign.md

2uasimojo

Good stuff @suhanime, thank you.

docs/enhancements/metricsConfig_redesign.md

2uasimojo · 2024-01-18T00:26:01Z

docs/enhancements/metricsConfig_redesign.md

+
+It can get confusing for consumers to know the fixed labels each metric has.Update the [hive_metrics](https://github.com/openshift/hive/blob/master/docs/hive_metrics.md) doc to list out all labels and keep it up-to-date.
+
+Also, the more customizations there are, the longer hiveConfig will be. Currently, we view this as an acceptable change.


Mm, this is a good point. There's a limit on the size of a k8s object. We should probably do a little bit of back-of-napkin math to see roughly what order of magnitude this config could be for scenarios where e.g. every metric gets configured for platform and version labels.

Today on hivep02ue1 (all prod shards should be identical, or near enough as makes no difference):

$ oc get hiveconfig hive -o yaml | wc -c 77278

So, k8s object can be about 3MB in size, and given the size:lines ratio of prod and staging hiveconfigs, we if we consider 100 bytes per line, we can assume upto 30000(?) lines allowed for hiveconfig. Right now, one of those is already at 800+ lines.
My rough calculation shows about 29 metrics that would be classified as clusterDeploymentMetrics, and if we assume filters as the example (say 15 lines per metric), we're clocking at 400+ lines extra if all affected metrics are defined. While we can assume we won't exceed the size restraints of k8s object, it would certainly be not easy to read the hiveconfig and we'd have to be mindful about this if we plan to ever add more customization for other metrics that are not related to clusterDeployment (like clusterPool, syncSets, selectorSyncSets, hive Operator and controller related metrics).
I have proposed grouping of metrics per customization, so instead of each entry of clusterDeploymentRelatedMetrics being an individual metric, we can instead provide a list of metrics. This would make the implementation more tricky, but it would be a more robust design, and of course, the admin is free to provide just 1 metric instead of an entire list.
This would also make it easy for admins to group filters (like do X for these aro clusters for these Y metrics we care about). We'd need to have strict validation to ensure there are no ambiguous entry.
What do you think?

Thanks for doing the numbers. Given what you've computed, I'm really not worried about exceeding the k8s object size limit, even if we extend this to every metric we produce, and even if we add more later.

I'm also not worried about hiveconfig being awkward for humans to read. That ship has sailed, what with all the privatelink jazz in there. I'm fine with the user doing jq/yq wizardry if they really need something they can't just do a quick text search for.

That said, I do like your idea of being able to group filters by supplying a list of metric names. Let me take a closer look at the proposal and get back to you.

[Later] Having seen it, I still like it, and don't see any obvious gotchas. Let's see what other reviewers think, but I'm inclined to go with it.

docs/enhancements/metricsConfig_redesign.md

suhanime

@2uasimojo ready for another pass

docs/enhancements/metricsConfig_redesign.md

suhanime · 2024-01-25T19:45:37Z

docs/enhancements/metricsConfig_redesign.md

+
+It can get confusing for consumers to know the fixed labels each metric has.Update the [hive_metrics](https://github.com/openshift/hive/blob/master/docs/hive_metrics.md) doc to list out all labels and keep it up-to-date.
+
+Also, the more customizations there are, the longer hiveConfig will be. Currently, we view this as an acceptable change.


So, k8s object can be about 3MB in size, and given the size:lines ratio of prod and staging hiveconfigs, we if we consider 100 bytes per line, we can assume upto 30000(?) lines allowed for hiveconfig. Right now, one of those is already at 800+ lines.
My rough calculation shows about 29 metrics that would be classified as clusterDeploymentMetrics, and if we assume filters as the example (say 15 lines per metric), we're clocking at 400+ lines extra if all affected metrics are defined. While we can assume we won't exceed the size restraints of k8s object, it would certainly be not easy to read the hiveconfig and we'd have to be mindful about this if we plan to ever add more customization for other metrics that are not related to clusterDeployment (like clusterPool, syncSets, selectorSyncSets, hive Operator and controller related metrics).
I have proposed grouping of metrics per customization, so instead of each entry of clusterDeploymentRelatedMetrics being an individual metric, we can instead provide a list of metrics. This would make the implementation more tricky, but it would be a more robust design, and of course, the admin is free to provide just 1 metric instead of an entire list.
This would also make it easy for admins to group filters (like do X for these aro clusters for these Y metrics we care about). We'd need to have strict validation to ensure there are no ambiguous entry.
What do you think?

docs/enhancements/metricsConfig_redesign.md

2uasimojo

This is looking great. I think the only crucial issues at this stage are

Deciding whether to make all affected metrics off by default
Deciding on default behavior (if neither metrics config is provided) during the deprecation period.

docs/enhancements/metricsConfig_redesign.md

bmeng · 2024-02-02T13:31:34Z

With a read through the design doc, I do not think there would be any issue from SREP side.

tzvatot · 2024-02-04T09:20:18Z

docs/enhancements/metricsConfig_redesign.md

Reading through this, I have several questions:

Is there/should there be an ADR/DDR for this? I was working on https://docs.google.com/document/d/1lYK_S3LCt-gCwV9d4Y5Oe2FeBfQ_6C_EyudCRtOrw3k/edit#heading=h.bupciudrwmna which kinds of overlap

What will be the frequency of updating the metrics? CS is resonciling on the entire fleet, so we should be minded of the performance penalty of getting a load of metrics update simultaniosly.

While this may solve the hive domain, we still have the same challange with ACM/Hypershift. It would be better to have a single way of reporting metrics (a new CRD?) that we can potentially re-use in the HCP domain. Having diverged solutions (hive, HCP, and whatever comes next) slows down development process.

Added a comment on the jira ticket referring on the examples provided there: https://issues.redhat.com/browse/HIVE-2344?focusedId=24065859&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24065859

Hi @tzvatot. Thank you for reviewing! TL;DR we expect this enhancement to be of little interest to CS, but wanted to get your eyes on it just in case :)

I don't believe there's overlap with that ADR, which IIUC is looking to get away from using metrics to gather per-cluster data. That's appropriate IMO: metrics should aggregate trends, not report on individual clusters. That way lies cardinality hell.

We update the metrics continuously. Some are updated realtime as we reconcile objects that affect them. Others are updated periodically (/2m) as we poll those objects. In all cases there is a certain amount of delay built in based on the scrape interval of the prometheus service. All standard prometheus patterns I believe.

I can't see a world where it makes sense to report metrics in a CRD, nor where it makes sense for two different services (hive and hypershift) to attempt to converge on design of specific metrics. I suspect we're conflating use cases here. It makes sense to converge on reporting metadata as described by the ADR you linked -- but that's a different thing entirely.

Ack.

suhanime

@2uasimojo ready for another pass.
I think we still differ on whether all metrics should be made optional or not, and I personally would like to handle that separately. Let's discuss some more on that.

docs/enhancements/metricsConfig_redesign.md

suhanime

@2uasimojo Hopefully this is it?

suhanime · 2024-02-21T20:01:53Z

docs/enhancements/metricsConfig_redesign.md

+
+#### Failure modes
+
+These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.


Okay so I looked at your suggestion for registering the metric in hiveOperator reconcile, but it wouldn't work correctly because we would need to share the registry with which we're registering the metric. I'm still unsure if we can make it work, and then there's separate registration for HiveOperator and Hive controller metrics - so I didn't specify the implementation details here, just the result.

Well, hold on, what harm would there actually be in registering them if we never report them? Doesn't the /metrics endpoint only show time series with values, or am I remembering that wrong?

[Later] We could also create a separate registry just for this purpose, and throw it away.

Re "metrics not getting registered", I think we'll actually end up just... not deploying the controllers at all, right? Since hive-operator will be erroring out of its reconcile loop? So indeed metrics wouldn't get registered... because the thing that would be registering them isn't running :)

I meant, I have never actually implemented registering the metric in one controller and reporting it via another - given that a lot of these implementation details are handled by the metrics library, we would have to try it to know it. Regardless of how we choose to implement it (yeah, I'm making this a future-me problem), the result will be the hiveConfig ReadyCondition getting set to false, so that's the only thing I'm choosing to mention in the enhancement.

Ah, okay, this may help:

We would be registering the metrics in hive-operator only as a means to validate that they work properly.

We will also need to register them in the controllers. Those are separate processes running in separate pods, so they can't share one registration.

(But they can share source code, and should, since the registrations themselves should be identical in both places.)

Suggested change

These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.

These situations will result in hive-operator failing to deploy the controllers. It will set hiveConfig.Status.Conditions[ReadyCondition] to `"False"` with an appropriate error message.

suhanime · 2024-02-21T20:04:59Z

docs/enhancements/metricsConfig_redesign.md

+
+
+### Risks and Mitigations
+The biggest risk is the sheer number of metrics that are going to be affected. This would require thorough testing to ensure no metric changes its behaviour unexpectedly.


I removed the calculation for the total metrics affected. All duration-based metrics should support minimumDuration, regardless of whether they're related to cluster deployment or not, but the other 2 customizations can only be supported if we have a cluster deployment available. I decided to let those implementation details be clarified in the hive docs for metrics once we're done with these changes.

If it matters, now 17 metrics will support minimumDuration, 6 of them already have it so 11 more than we already do. Rest of the calcs stay the same - 19 more will have additional label support and 29 will support label selection.
52 total metrics that we can report as of now, so 46 metrics of them will now become optional.
We originally wanted the math because we were worried about exceeding size of hiveconfig, since that's no longer an issue - no point calling it out.

docs/enhancements/metricsConfig_redesign.md

2uasimojo

Yup, this is getting real close. Only really one substantive change (schema compat discrepancy I missed when I suggested the current hiveconfig shape).

docs/enhancements/metricsConfig_redesign.md

2uasimojo · 2024-02-22T23:09:28Z

docs/enhancements/metricsConfig_redesign.md

+
+#### Failure modes
+
+These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.


[Later] We could also create a separate registry just for this purpose, and throw it away.

2uasimojo · 2024-02-22T23:12:49Z

docs/enhancements/metricsConfig_redesign.md

+
+#### Failure modes
+
+These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.


Re "metrics not getting registered", I think we'll actually end up just... not deploying the controllers at all, right? Since hive-operator will be erroring out of its reconcile loop? So indeed metrics wouldn't get registered... because the thing that would be registering them isn't running :)

docs/enhancements/metricsConfig_redesign.md

suhanime

@2uasimojo Final Pass?

docs/enhancements/metricsConfig_redesign.md

suhanime · 2024-02-28T20:27:15Z

docs/enhancements/metricsConfig_redesign.md

+
+#### Failure modes
+
+These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.


I meant, I have never actually implemented registering the metric in one controller and reporting it via another - given that a lot of these implementation details are handled by the metrics library, we would have to try it to know it. Regardless of how we choose to implement it (yeah, I'm making this a future-me problem), the result will be the hiveConfig ReadyCondition getting set to false, so that's the only thing I'm choosing to mention in the enhancement.

docs/enhancements/metricsConfig_redesign.md

2uasimojo

If you agree with my suggestions, we can open this up for broader review now.

2uasimojo · 2024-02-29T21:13:02Z

docs/enhancements/metricsConfig_redesign.md

+There can only be 1 entry for `metricsToReport`, however it can have multiple entries for `metricNames`.
+
+#### metricNames
+`metricNames` would be a non-empty list of hive metrics that need to be reported, and can have optional customizations.
+All customizations listed for an entry of `metricNames` will apply to all the metrics provided in the corresponding list. This allows for grouping filters for a list of relevant metrics.
+While multiple entries of `metricNames` are allowed within `metricsToReport`, no duplicate entries of a hive metric will be allowed in `metricsToReport` inorder to avoid ambiguity.
+Implementation will be adapting the existing setup for optional metrics. However, instead of a shorthand camelCase key that is used for current `hiveconfig.spec.metricsConfig.metricsWithDuration`, we would now need the full metric name listed under `metricNames`.
+In the example above, hive would only log `hive_foo_counter`, `hive_foo_gauge` and `hive_foo_histogram` metrics.
+
+#### minimumDuration
+Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport.metricNames[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same.
+In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes.
+
+#### additionalClusterDeploymentLabels
+Deprecate current implementation of hiveConfig.spec.metricsConfig.additionalClusterDeploymentLabels and change it to be reported as metricsConfig.metricsToReport.metricNames[].additionalClusterDeploymentLabels. Its implementation will not change.
+In the example above, `hive_foo_counter` and `hive_foo_gauge` would report an additional label `prom_label_name`, its value corresponding to the value of `hive.openshift.io/cd-label-key` label on the corresponding clusterDeployment.
+
+#### clusterDeploymentLabelSelector
+This would be a new feature, of type [LabelSelector](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#LabelSelector), and we'd use the LabelSelector.MatchLabels and/or LabelSelector.MatchExpressions to match the conditions in order to decide if a metric should be reported.
+This encapsulates slightly more advanced filter logic over the existing clusterDeployment labels. In the example above, `hive_foo_counter` and `hive_foo_gauge` metrics will only be reported for the clusterDeployments that are labelled with aro-snowflake and not in limited support.
+
+### Implementation Details / Notes
+
+- All the options configured for a metric within metricNames, will work in tandem with each other. For ex, if all possible options are specified for a metric, then that metric will be reported with the additional labels as per additionalClusterDeploymentLabels, and will be reported only if it matches the labels and/or expressions as per clusterDeploymentLabelSelector and if the duration to be reported exceeds the minimumDuration


Again, the reader will probably "get it", but this still isn't quiiite right. Let me see if I can reword...

Suggested change

There can only be 1 entry for `metricsToReport`, however it can have multiple entries for `metricNames`.

#### metricNames

`metricNames` would be a non-empty list of hive metrics that need to be reported, and can have optional customizations.

All customizations listed for an entry of `metricNames` will apply to all the metrics provided in the corresponding list. This allows for grouping filters for a list of relevant metrics.

While multiple entries of `metricNames` are allowed within `metricsToReport`, no duplicate entries of a hive metric will be allowed in `metricsToReport` inorder to avoid ambiguity.

Implementation will be adapting the existing setup for optional metrics. However, instead of a shorthand camelCase key that is used for current `hiveconfig.spec.metricsConfig.metricsWithDuration`, we would now need the full metric name listed under `metricNames`.

In the example above, hive would only log `hive_foo_counter`, `hive_foo_gauge` and `hive_foo_histogram` metrics.

#### minimumDuration

Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport.metricNames[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same.

In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes.

#### additionalClusterDeploymentLabels

Deprecate current implementation of hiveConfig.spec.metricsConfig.additionalClusterDeploymentLabels and change it to be reported as metricsConfig.metricsToReport.metricNames[].additionalClusterDeploymentLabels. Its implementation will not change.

In the example above, `hive_foo_counter` and `hive_foo_gauge` would report an additional label `prom_label_name`, its value corresponding to the value of `hive.openshift.io/cd-label-key` label on the corresponding clusterDeployment.

#### clusterDeploymentLabelSelector

This would be a new feature, of type [LabelSelector](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#LabelSelector), and we'd use the LabelSelector.MatchLabels and/or LabelSelector.MatchExpressions to match the conditions in order to decide if a metric should be reported.

This encapsulates slightly more advanced filter logic over the existing clusterDeployment labels. In the example above, `hive_foo_counter` and `hive_foo_gauge` metrics will only be reported for the clusterDeployments that are labelled with aro-snowflake and not in limited support.

### Implementation Details / Notes

- All the options configured for a metric within metricNames, will work in tandem with each other. For ex, if all possible options are specified for a metric, then that metric will be reported with the additional labels as per additionalClusterDeploymentLabels, and will be reported only if it matches the labels and/or expressions as per clusterDeploymentLabelSelector and if the duration to be reported exceeds the minimumDuration

`metricsToReport` is a list, each element of which requests reporting and configures filtering and customizations for the metrics provided in its `metricNames` list.

#### metricNames

`metricNames` would be a non-empty list of hive metrics that need to be reported, and can have optional customizations.

All customizations listed for an entry of `metricToReport` will apply to all the metrics provided in that entry's `metricNames` list. This allows for grouping filters for a list of relevant metrics.

A metric name must appear at most once across all `metricsToReport[].metricNames[]` in order to avoid ambiguity.

Implementation will be adapting the existing setup for optional metrics. However, instead of a shorthand camelCase key that is used for current `hiveconfig.spec.metricsConfig.metricsWithDuration`, we would now need the full metric name listed under `metricNames`.

In the example above, hive would only log `hive_foo_counter`, `hive_foo_gauge` and `hive_foo_histogram` metrics.

#### minimumDuration

Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same.

In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes.

#### additionalClusterDeploymentLabels

Deprecate current implementation of hiveConfig.spec.metricsConfig.additionalClusterDeploymentLabels and change it to be reported as metricsConfig.metricsToReport[].additionalClusterDeploymentLabels. Its implementation will not change.

In the example above, `hive_foo_counter` and `hive_foo_gauge` would report an additional label `prom_label_name`, its value corresponding to the value of `hive.openshift.io/cd-label-key` label on the corresponding clusterDeployment.

#### clusterDeploymentLabelSelector

This would be a new feature, of type [LabelSelector](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#LabelSelector), and we'd use the LabelSelector.MatchLabels and/or LabelSelector.MatchExpressions to match the conditions in order to decide if a metric should be reported.

This encapsulates slightly more advanced filter logic over the existing clusterDeployment labels. In the example above, `hive_foo_counter` and `hive_foo_gauge` metrics will only be reported for the clusterDeployments that are labelled with aro-snowflake and not in limited support.

### Implementation Details / Notes

- All the options configured for a `metricsToReport` entry will work in tandem with each other for each metric listed in its `metricNames`. For ex, if all possible options are specified for a metric, then that metric will be reported with the additional labels as per additionalClusterDeploymentLabels, and will be reported only if it matches the labels and/or expressions as per clusterDeploymentLabelSelector and if the duration to be reported exceeds the minimumDuration

2uasimojo · 2024-02-29T21:17:04Z

docs/enhancements/metricsConfig_redesign.md

+
+#### Failure modes
+
+These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.


Ah, okay, this may help:

We would be registering the metrics in hive-operator only as a means to validate that they work properly.

We will also need to register them in the controllers. Those are separate processes running in separate pods, so they can't share one registration.

(But they can share source code, and should, since the registrations themselves should be identical in both places.)

Suggested change

These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.

These situations will result in hive-operator failing to deploy the controllers. It will set hiveConfig.Status.Conditions[ReadyCondition] to `"False"` with an appropriate error message.

2uasimojo · 2024-02-29T21:19:43Z

docs/enhancements/metricsConfig_redesign.md

+- metricsToReport.metricNames[$name] doesn't exist as a metric.
+- metricsToReport.metricNames[$name] is a metric without duration but minimumDuration was specified.
+- there are multiple entries of metricsToReport.metricNames[$name].


Suggested change

- metricsToReport.metricNames[$name] doesn't exist as a metric.

- metricsToReport.metricNames[$name] is a metric without duration but minimumDuration was specified.

- there are multiple entries of metricsToReport.metricNames[$name].

- `metricsToReport[].metricNames[$name]` doesn't exist as a metric.

- `metricNames` lists a metric without duration but `minimumDuration` is specified in the same `metricsToReport` entry..

- the same metric is mentioned more than once across all `metricsToReport[].metricNames[]`.

suhanime · 2024-03-01T05:57:38Z

@2uasimojo Implemented all your suggestions and rebased, this is ready for a broader review.

@bmeng, @tzvatot and @hongkailiu Can you please review this again? We made some changes, the biggest one is that we have decided to make all metrics optional (aka hive will not report any metric by default post the deprecation period).

@berenss Is it okay to assume ACM doesn't use Hive metrics, hence it won't be affected by these changes?

@maorfr Would you be able to help find an appropriate reviewer from AppSRE?

2uasimojo · 2024-03-01T17:34:53Z

Thanks @suhanime! This lgtm to open for broader review.

suhanime · 2024-03-13T15:43:33Z

@dustman9000 @patjlm @janboll Added you to the Reviewers. Please review at your earlier convenience.

@berenss and @tzvatot confirmed no impact to ACM and OCM

Waiting for the go-ahead from SREP, App-SRE and @hongkailiu

docs/enhancements/metricsConfig_redesign.md

dofinn · 2024-03-18T00:57:08Z

docs/enhancements/metricsConfig_redesign.md

+  - [Risks and Mitigations](#risks-and-mitigations)
+
+## Summary
+As a Hive user interested in customizing metrics, I would like to specify customizations for individual metrics.


Problem statement is not entirely clear to me. It seems like there a multiple issues simmering regarding "how to effectively operate hive" and they are all trying to be solved with how hive produces metrics.

I think for an effective metrics solution, we need a concrete operational narrative that defines the role of metrics within it. Currently metrics are being proposed as solutions to:

observability

alerting

troubleshooting (inferring this)

IMO, stake holder expectations need to defined and agreed upon; including those that are not RH SRE. Then we can assess the solution against the problem statement.

I changed the summary to be more precise - does it help?
I can see your point about having concrete operational narrative around metrics, but to be clear, hive's job is to publish the metrics, and it is upto the consumer how they want to use them. You can very well use the same metric for both observability and alerting. I don't think prometheus and metrics are the solutions for troubleshooting, though you can definitely use them to notice the patterns.
I would also want to point out - as a part of this enhancement, we're just changing some internal code to make things easier - we're not introducing any new metrics. We've also gotten it reviewed from all major consumers of hive.

janboll · 2024-03-22T13:16:21Z

docs/enhancements/metricsConfig_redesign.md

+
+#### minimumDuration
+Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport.metricNames[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same.
+In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes.


Isn't a counter just being incremented? Which value do you compare here?

My bad - counter can't have a minimumDuration - I switched it out in the example with a histogram

janboll · 2024-03-22T13:50:26Z

docs/enhancements/metricsConfig_redesign.md

+- if deprecated methods are used along with the new metricsToReport. You can either choose the _old_ way or the _new_ way.
+
+
+### Risks and Mitigations


Do you have processes or tools in mind, to prevent acidentialy enabling a lot of metrics?
Or asked in another way, can you predict the impact/outcome of a config change?

I understand the intent of asking this, but the entire intent of this proposal is to put the power in admin's hands.
I see some 50+ metrics in our docs, only 7 of them are optional right now and not logged by default, so there isn't a risk of logging too many metrics
Our biggest concern is usually the cardinality - and we can document our concerns and warnings well, we can also include warnings in the docstrings of HiveConfig or maybe the logs (though I'm not sure if anyone checks those warnings in logs).
As far as the prediction of the impact of a config change goes, no we can't - not to the extent you're asking. Real concerns come into play only when a hive instance maintains too many clusters and/or you really want to enforce a per-cluster label on metrics - which simply isn't a good design and not what prometheus or metrics are meant for.
Since we're offering avenues where there could potentially be concerns - we're also offering the counter measures of applying minimum threshold or label selectors - or the option of not publishing the metrics you do not need - so we can lessen the chances of the metrics blowing up.

- Make all hive metrics optional - Instead of throwing a panic, update hiveconfig ready condition to false. - Add LabelSelector support for clusterDeployment related metrics - Expand support of minimumDuration and additionalClusterDeploymentLabels for all eligible metrics

2uasimojo · 2024-05-10T14:51:08Z

docs/enhancements/metricsConfig_redesign.md

+As a Hive user interested in consuming the metrics it publishes, I would like to be able to configure the metrics as per my needs. 
+This includes the ability to choose which metrics are published, reduce the amount of observations for certain metrics unless it meets configured conditions, and the ability to add labels to the reported metrics for grouping and filtering purposes.
+
+## Motivation


Excellent update here, thanks!

2uasimojo · 2024-05-10T14:53:37Z

/lgtm

We may find little things to tweak/add as we implement, but I think this is ready to land. Nice work!

openshift-ci · 2024-05-10T14:59:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, suhanime

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [2uasimojo,suhanime]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot assigned 2uasimojo Jan 10, 2024

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 10, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2024

suhanime marked this pull request as ready for review January 12, 2024 15:55

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2024

openshift-ci bot requested review from dlom and lleshchi January 12, 2024 15:56

suhanime commented Jan 17, 2024

View reviewed changes

docs/enhancements/metricsConfig_redesign.md Show resolved Hide resolved

2uasimojo reviewed Jan 18, 2024

View reviewed changes

suhanime force-pushed the HIVE-2344 branch from f69c1c2 to 762ca1d Compare January 25, 2024 20:37

suhanime commented Jan 25, 2024

View reviewed changes

suhanime force-pushed the HIVE-2344 branch 2 times, most recently from 6c7e0b0 to 568b41b Compare January 26, 2024 10:08

suhanime commented Jan 26, 2024

View reviewed changes

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved

2uasimojo reviewed Jan 29, 2024

View reviewed changes

tzvatot reviewed Feb 4, 2024

View reviewed changes

suhanime force-pushed the HIVE-2344 branch from 568b41b to 5e87142 Compare February 7, 2024 19:36

suhanime commented Feb 7, 2024

View reviewed changes

2uasimojo reviewed Feb 7, 2024

View reviewed changes

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved

suhanime force-pushed the HIVE-2344 branch 3 times, most recently from 6be1199 to 9a437db Compare February 21, 2024 20:06

suhanime commented Feb 21, 2024

View reviewed changes

2uasimojo reviewed Feb 22, 2024

View reviewed changes

suhanime force-pushed the HIVE-2344 branch from 9a437db to 58660b4 Compare February 28, 2024 20:49

suhanime commented Feb 28, 2024

View reviewed changes

2uasimojo reviewed Feb 29, 2024

View reviewed changes

suhanime force-pushed the HIVE-2344 branch from 58660b4 to 5b5f9cf Compare March 1, 2024 05:44

suhanime force-pushed the HIVE-2344 branch from 5b5f9cf to 77b9143 Compare March 13, 2024 15:39

hongkailiu reviewed Mar 13, 2024

View reviewed changes

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved

docs/enhancements/metricsConfig_redesign.md Show resolved Hide resolved

suhanime force-pushed the HIVE-2344 branch from 77b9143 to 39e7bab Compare March 14, 2024 16:54

dofinn reviewed Mar 18, 2024

View reviewed changes

janboll reviewed Mar 22, 2024

View reviewed changes

suhanime force-pushed the HIVE-2344 branch from 39e7bab to 15bf389 Compare March 27, 2024 18:58

suhanime force-pushed the HIVE-2344 branch from 15bf389 to ae0dfe5 Compare May 9, 2024 16:15

2uasimojo reviewed May 10, 2024

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 10, 2024

openshift-merge-bot bot merged commit 83aedb9 into openshift:master May 10, 2024
3 checks passed


		It can get confusing for consumers to know the fixed labels each metric has.Update the [hive_metrics](https://github.com/openshift/hive/blob/master/docs/hive_metrics.md) doc to list out all labels and keep it up-to-date.

		Also, the more customizations there are, the longer hiveConfig will be. Currently, we view this as an acceptable change.


		#### Failure modes

		These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.

	These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.
	These situations will result in hive-operator failing to deploy the controllers. It will set hiveConfig.Status.Conditions[ReadyCondition] to `"False"` with an appropriate error message.



		### Risks and Mitigations
		The biggest risk is the sheer number of metrics that are going to be affected. This would require thorough testing to ensure no metric changes its behaviour unexpectedly.

		- if deprecated methods are used along with the new metricsToReport. You can either choose the _old_ way or the _new_ way.


		### Risks and Mitigations

Enhancement for metricsConfig redesign #2182

Enhancement for metricsConfig redesign #2182

Conversation

suhanime commented Jan 10, 2024

openshift-ci bot commented Jan 10, 2024

2uasimojo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suhanime left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2uasimojo left a comment

Choose a reason for hiding this comment

bmeng commented Feb 2, 2024

Choose a reason for hiding this comment

2uasimojo Feb 5, 2024 • edited

Choose a reason for hiding this comment

suhanime left a comment

Choose a reason for hiding this comment

suhanime left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2uasimojo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suhanime left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2uasimojo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suhanime commented Mar 1, 2024

2uasimojo commented Mar 1, 2024

suhanime commented Mar 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2uasimojo commented May 10, 2024

openshift-ci bot commented May 10, 2024

2uasimojo Feb 5, 2024 •

edited