Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement for metricsConfig redesign #2182

Merged
merged 1 commit into from
May 10, 2024

Conversation

suhanime
Copy link
Contributor

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 10, 2024
Copy link
Contributor

openshift-ci bot commented Jan 10, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2024
@suhanime suhanime marked this pull request as ready for review January 12, 2024 15:55
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2024
@openshift-ci openshift-ci bot requested review from dlom and lleshchi January 12, 2024 15:56
Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff @suhanime, thank you.

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved

It can get confusing for consumers to know the fixed labels each metric has.Update the [hive_metrics](https://github.com/openshift/hive/blob/master/docs/hive_metrics.md) doc to list out all labels and keep it up-to-date.

Also, the more customizations there are, the longer hiveConfig will be. Currently, we view this as an acceptable change.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, this is a good point. There's a limit on the size of a k8s object. We should probably do a little bit of back-of-napkin math to see roughly what order of magnitude this config could be for scenarios where e.g. every metric gets configured for platform and version labels.

Today on hivep02ue1 (all prod shards should be identical, or near enough as makes no difference):

$ oc get hiveconfig hive -o yaml | wc -c
77278

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, k8s object can be about 3MB in size, and given the size:lines ratio of prod and staging hiveconfigs, we if we consider 100 bytes per line, we can assume upto 30000(?) lines allowed for hiveconfig. Right now, one of those is already at 800+ lines.
My rough calculation shows about 29 metrics that would be classified as clusterDeploymentMetrics, and if we assume filters as the example (say 15 lines per metric), we're clocking at 400+ lines extra if all affected metrics are defined. While we can assume we won't exceed the size restraints of k8s object, it would certainly be not easy to read the hiveconfig and we'd have to be mindful about this if we plan to ever add more customization for other metrics that are not related to clusterDeployment (like clusterPool, syncSets, selectorSyncSets, hive Operator and controller related metrics).
I have proposed grouping of metrics per customization, so instead of each entry of clusterDeploymentRelatedMetrics being an individual metric, we can instead provide a list of metrics. This would make the implementation more tricky, but it would be a more robust design, and of course, the admin is free to provide just 1 metric instead of an entire list.
This would also make it easy for admins to group filters (like do X for these aro clusters for these Y metrics we care about). We'd need to have strict validation to ensure there are no ambiguous entry.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the numbers. Given what you've computed, I'm really not worried about exceeding the k8s object size limit, even if we extend this to every metric we produce, and even if we add more later.

I'm also not worried about hiveconfig being awkward for humans to read. That ship has sailed, what with all the privatelink jazz in there. I'm fine with the user doing jq/yq wizardry if they really need something they can't just do a quick text search for.

That said, I do like your idea of being able to group filters by supplying a list of metric names. Let me take a closer look at the proposal and get back to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Later] Having seen it, I still like it, and don't see any obvious gotchas. Let's see what other reviewers think, but I'm inclined to go with it.

docs/enhancements/metricsConfig_redesign.md Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@suhanime suhanime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2uasimojo ready for another pass

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Show resolved Hide resolved

It can get confusing for consumers to know the fixed labels each metric has.Update the [hive_metrics](https://github.com/openshift/hive/blob/master/docs/hive_metrics.md) doc to list out all labels and keep it up-to-date.

Also, the more customizations there are, the longer hiveConfig will be. Currently, we view this as an acceptable change.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, k8s object can be about 3MB in size, and given the size:lines ratio of prod and staging hiveconfigs, we if we consider 100 bytes per line, we can assume upto 30000(?) lines allowed for hiveconfig. Right now, one of those is already at 800+ lines.
My rough calculation shows about 29 metrics that would be classified as clusterDeploymentMetrics, and if we assume filters as the example (say 15 lines per metric), we're clocking at 400+ lines extra if all affected metrics are defined. While we can assume we won't exceed the size restraints of k8s object, it would certainly be not easy to read the hiveconfig and we'd have to be mindful about this if we plan to ever add more customization for other metrics that are not related to clusterDeployment (like clusterPool, syncSets, selectorSyncSets, hive Operator and controller related metrics).
I have proposed grouping of metrics per customization, so instead of each entry of clusterDeploymentRelatedMetrics being an individual metric, we can instead provide a list of metrics. This would make the implementation more tricky, but it would be a more robust design, and of course, the admin is free to provide just 1 metric instead of an entire list.
This would also make it easy for admins to group filters (like do X for these aro clusters for these Y metrics we care about). We'd need to have strict validation to ensure there are no ambiguous entry.
What do you think?

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
@suhanime suhanime force-pushed the HIVE-2344 branch 2 times, most recently from 6c7e0b0 to 568b41b Compare January 26, 2024 10:08
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great. I think the only crucial issues at this stage are

  • Deciding whether to make all affected metrics off by default
  • Deciding on default behavior (if neither metrics config is provided) during the deprecation period.

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
@bmeng
Copy link
Contributor

bmeng commented Feb 2, 2024

With a read through the design doc, I do not think there would be any issue from SREP side.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this, I have several questions:

  1. Is there/should there be an ADR/DDR for this? I was working on https://docs.google.com/document/d/1lYK_S3LCt-gCwV9d4Y5Oe2FeBfQ_6C_EyudCRtOrw3k/edit#heading=h.bupciudrwmna which kinds of overlap
  2. What will be the frequency of updating the metrics? CS is resonciling on the entire fleet, so we should be minded of the performance penalty of getting a load of metrics update simultaniosly.
  3. While this may solve the hive domain, we still have the same challange with ACM/Hypershift. It would be better to have a single way of reporting metrics (a new CRD?) that we can potentially re-use in the HCP domain. Having diverged solutions (hive, HCP, and whatever comes next) slows down development process.
  4. Added a comment on the jira ticket referring on the examples provided there: https://issues.redhat.com/browse/HIVE-2344?focusedId=24065859&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24065859

Copy link
Member

@2uasimojo 2uasimojo Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tzvatot. Thank you for reviewing! TL;DR we expect this enhancement to be of little interest to CS, but wanted to get your eyes on it just in case :)

  1. I don't believe there's overlap with that ADR, which IIUC is looking to get away from using metrics to gather per-cluster data. That's appropriate IMO: metrics should aggregate trends, not report on individual clusters. That way lies cardinality hell.
  2. We update the metrics continuously. Some are updated realtime as we reconcile objects that affect them. Others are updated periodically (/2m) as we poll those objects. In all cases there is a certain amount of delay built in based on the scrape interval of the prometheus service. All standard prometheus patterns I believe.
  3. I can't see a world where it makes sense to report metrics in a CRD, nor where it makes sense for two different services (hive and hypershift) to attempt to converge on design of specific metrics. I suspect we're conflating use cases here. It makes sense to converge on reporting metadata as described by the ADR you linked -- but that's a different thing entirely.
  4. Ack.

Copy link
Contributor Author

@suhanime suhanime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2uasimojo ready for another pass.
I think we still differ on whether all metrics should be made optional or not, and I personally would like to handle that separately. Let's discuss some more on that.

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
@suhanime suhanime force-pushed the HIVE-2344 branch 3 times, most recently from 6be1199 to 9a437db Compare February 21, 2024 20:06
Copy link
Contributor Author

@suhanime suhanime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2uasimojo Hopefully this is it?


#### Failure modes

These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so I looked at your suggestion for registering the metric in hiveOperator reconcile, but it wouldn't work correctly because we would need to share the registry with which we're registering the metric. I'm still unsure if we can make it work, and then there's separate registration for HiveOperator and Hive controller metrics - so I didn't specify the implementation details here, just the result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, hold on, what harm would there actually be in registering them if we never report them? Doesn't the /metrics endpoint only show time series with values, or am I remembering that wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Later] We could also create a separate registry just for this purpose, and throw it away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re "metrics not getting registered", I think we'll actually end up just... not deploying the controllers at all, right? Since hive-operator will be erroring out of its reconcile loop? So indeed metrics wouldn't get registered... because the thing that would be registering them isn't running :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, I have never actually implemented registering the metric in one controller and reporting it via another - given that a lot of these implementation details are handled by the metrics library, we would have to try it to know it. Regardless of how we choose to implement it (yeah, I'm making this a future-me problem), the result will be the hiveConfig ReadyCondition getting set to false, so that's the only thing I'm choosing to mention in the enhancement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, this may help:

We would be registering the metrics in hive-operator only as a means to validate that they work properly.

We will also need to register them in the controllers. Those are separate processes running in separate pods, so they can't share one registration.

(But they can share source code, and should, since the registrations themselves should be identical in both places.)

Suggested change
These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.
These situations will result in hive-operator failing to deploy the controllers. It will set hiveConfig.Status.Conditions[ReadyCondition] to `"False"` with an appropriate error message.



### Risks and Mitigations
The biggest risk is the sheer number of metrics that are going to be affected. This would require thorough testing to ensure no metric changes its behaviour unexpectedly.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the calculation for the total metrics affected. All duration-based metrics should support minimumDuration, regardless of whether they're related to cluster deployment or not, but the other 2 customizations can only be supported if we have a cluster deployment available. I decided to let those implementation details be clarified in the hive docs for metrics once we're done with these changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it matters, now 17 metrics will support minimumDuration, 6 of them already have it so 11 more than we already do. Rest of the calcs stay the same - 19 more will have additional label support and 29 will support label selection.
52 total metrics that we can report as of now, so 46 metrics of them will now become optional.
We originally wanted the math because we were worried about exceeding size of hiveconfig, since that's no longer an issue - no point calling it out.

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is getting real close. Only really one substantive change (schema compat discrepancy I missed when I suggested the current hiveconfig shape).

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved

#### Failure modes

These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Later] We could also create a separate registry just for this purpose, and throw it away.


#### Failure modes

These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re "metrics not getting registered", I think we'll actually end up just... not deploying the controllers at all, right? Since hive-operator will be erroring out of its reconcile loop? So indeed metrics wouldn't get registered... because the thing that would be registering them isn't running :)

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@suhanime suhanime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2uasimojo Final Pass?

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved

#### Failure modes

These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, I have never actually implemented registering the metric in one controller and reporting it via another - given that a lot of these implementation details are handled by the metrics library, we would have to try it to know it. Regardless of how we choose to implement it (yeah, I'm making this a future-me problem), the result will be the hiveConfig ReadyCondition getting set to false, so that's the only thing I'm choosing to mention in the enhancement.

docs/enhancements/metricsConfig_redesign.md Outdated Show resolved Hide resolved
Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you agree with my suggestions, we can open this up for broader review now.

Comment on lines 114 to 137
There can only be 1 entry for `metricsToReport`, however it can have multiple entries for `metricNames`.

#### metricNames
`metricNames` would be a non-empty list of hive metrics that need to be reported, and can have optional customizations.
All customizations listed for an entry of `metricNames` will apply to all the metrics provided in the corresponding list. This allows for grouping filters for a list of relevant metrics.
While multiple entries of `metricNames` are allowed within `metricsToReport`, no duplicate entries of a hive metric will be allowed in `metricsToReport` inorder to avoid ambiguity.
Implementation will be adapting the existing setup for optional metrics. However, instead of a shorthand camelCase key that is used for current `hiveconfig.spec.metricsConfig.metricsWithDuration`, we would now need the full metric name listed under `metricNames`.
In the example above, hive would only log `hive_foo_counter`, `hive_foo_gauge` and `hive_foo_histogram` metrics.

#### minimumDuration
Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport.metricNames[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same.
In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes.

#### additionalClusterDeploymentLabels
Deprecate current implementation of hiveConfig.spec.metricsConfig.additionalClusterDeploymentLabels and change it to be reported as metricsConfig.metricsToReport.metricNames[].additionalClusterDeploymentLabels. Its implementation will not change.
In the example above, `hive_foo_counter` and `hive_foo_gauge` would report an additional label `prom_label_name`, its value corresponding to the value of `hive.openshift.io/cd-label-key` label on the corresponding clusterDeployment.

#### clusterDeploymentLabelSelector
This would be a new feature, of type [LabelSelector](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#LabelSelector), and we'd use the LabelSelector.MatchLabels and/or LabelSelector.MatchExpressions to match the conditions in order to decide if a metric should be reported.
This encapsulates slightly more advanced filter logic over the existing clusterDeployment labels. In the example above, `hive_foo_counter` and `hive_foo_gauge` metrics will only be reported for the clusterDeployments that are labelled with aro-snowflake and not in limited support.

### Implementation Details / Notes

- All the options configured for a metric within metricNames, will work in tandem with each other. For ex, if all possible options are specified for a metric, then that metric will be reported with the additional labels as per additionalClusterDeploymentLabels, and will be reported only if it matches the labels and/or expressions as per clusterDeploymentLabelSelector and if the duration to be reported exceeds the minimumDuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the reader will probably "get it", but this still isn't quiiite right. Let me see if I can reword...

Suggested change
There can only be 1 entry for `metricsToReport`, however it can have multiple entries for `metricNames`.
#### metricNames
`metricNames` would be a non-empty list of hive metrics that need to be reported, and can have optional customizations.
All customizations listed for an entry of `metricNames` will apply to all the metrics provided in the corresponding list. This allows for grouping filters for a list of relevant metrics.
While multiple entries of `metricNames` are allowed within `metricsToReport`, no duplicate entries of a hive metric will be allowed in `metricsToReport` inorder to avoid ambiguity.
Implementation will be adapting the existing setup for optional metrics. However, instead of a shorthand camelCase key that is used for current `hiveconfig.spec.metricsConfig.metricsWithDuration`, we would now need the full metric name listed under `metricNames`.
In the example above, hive would only log `hive_foo_counter`, `hive_foo_gauge` and `hive_foo_histogram` metrics.
#### minimumDuration
Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport.metricNames[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same.
In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes.
#### additionalClusterDeploymentLabels
Deprecate current implementation of hiveConfig.spec.metricsConfig.additionalClusterDeploymentLabels and change it to be reported as metricsConfig.metricsToReport.metricNames[].additionalClusterDeploymentLabels. Its implementation will not change.
In the example above, `hive_foo_counter` and `hive_foo_gauge` would report an additional label `prom_label_name`, its value corresponding to the value of `hive.openshift.io/cd-label-key` label on the corresponding clusterDeployment.
#### clusterDeploymentLabelSelector
This would be a new feature, of type [LabelSelector](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#LabelSelector), and we'd use the LabelSelector.MatchLabels and/or LabelSelector.MatchExpressions to match the conditions in order to decide if a metric should be reported.
This encapsulates slightly more advanced filter logic over the existing clusterDeployment labels. In the example above, `hive_foo_counter` and `hive_foo_gauge` metrics will only be reported for the clusterDeployments that are labelled with aro-snowflake and not in limited support.
### Implementation Details / Notes
- All the options configured for a metric within metricNames, will work in tandem with each other. For ex, if all possible options are specified for a metric, then that metric will be reported with the additional labels as per additionalClusterDeploymentLabels, and will be reported only if it matches the labels and/or expressions as per clusterDeploymentLabelSelector and if the duration to be reported exceeds the minimumDuration
`metricsToReport` is a list, each element of which requests reporting and configures filtering and customizations for the metrics provided in its `metricNames` list.
#### metricNames
`metricNames` would be a non-empty list of hive metrics that need to be reported, and can have optional customizations.
All customizations listed for an entry of `metricToReport` will apply to all the metrics provided in that entry's `metricNames` list. This allows for grouping filters for a list of relevant metrics.
A metric name must appear at most once across all `metricsToReport[].metricNames[]` in order to avoid ambiguity.
Implementation will be adapting the existing setup for optional metrics. However, instead of a shorthand camelCase key that is used for current `hiveconfig.spec.metricsConfig.metricsWithDuration`, we would now need the full metric name listed under `metricNames`.
In the example above, hive would only log `hive_foo_counter`, `hive_foo_gauge` and `hive_foo_histogram` metrics.
#### minimumDuration
Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same.
In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes.
#### additionalClusterDeploymentLabels
Deprecate current implementation of hiveConfig.spec.metricsConfig.additionalClusterDeploymentLabels and change it to be reported as metricsConfig.metricsToReport[].additionalClusterDeploymentLabels. Its implementation will not change.
In the example above, `hive_foo_counter` and `hive_foo_gauge` would report an additional label `prom_label_name`, its value corresponding to the value of `hive.openshift.io/cd-label-key` label on the corresponding clusterDeployment.
#### clusterDeploymentLabelSelector
This would be a new feature, of type [LabelSelector](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#LabelSelector), and we'd use the LabelSelector.MatchLabels and/or LabelSelector.MatchExpressions to match the conditions in order to decide if a metric should be reported.
This encapsulates slightly more advanced filter logic over the existing clusterDeployment labels. In the example above, `hive_foo_counter` and `hive_foo_gauge` metrics will only be reported for the clusterDeployments that are labelled with aro-snowflake and not in limited support.
### Implementation Details / Notes
- All the options configured for a `metricsToReport` entry will work in tandem with each other for each metric listed in its `metricNames`. For ex, if all possible options are specified for a metric, then that metric will be reported with the additional labels as per additionalClusterDeploymentLabels, and will be reported only if it matches the labels and/or expressions as per clusterDeploymentLabelSelector and if the duration to be reported exceeds the minimumDuration


#### Failure modes

These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, this may help:

We would be registering the metrics in hive-operator only as a means to validate that they work properly.

We will also need to register them in the controllers. Those are separate processes running in separate pods, so they can't share one registration.

(But they can share source code, and should, since the registrations themselves should be identical in both places.)

Suggested change
These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:.
These situations will result in hive-operator failing to deploy the controllers. It will set hiveConfig.Status.Conditions[ReadyCondition] to `"False"` with an appropriate error message.

Comment on lines 152 to 154
- metricsToReport.metricNames[$name] doesn't exist as a metric.
- metricsToReport.metricNames[$name] is a metric without duration but minimumDuration was specified.
- there are multiple entries of metricsToReport.metricNames[$name].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- metricsToReport.metricNames[$name] doesn't exist as a metric.
- metricsToReport.metricNames[$name] is a metric without duration but minimumDuration was specified.
- there are multiple entries of metricsToReport.metricNames[$name].
- `metricsToReport[].metricNames[$name]` doesn't exist as a metric.
- `metricNames` lists a metric without duration but `minimumDuration` is specified in the same `metricsToReport` entry..
- the same metric is mentioned more than once across all `metricsToReport[].metricNames[]`.

@suhanime
Copy link
Contributor Author

suhanime commented Mar 1, 2024

@2uasimojo Implemented all your suggestions and rebased, this is ready for a broader review.

@bmeng, @tzvatot and @hongkailiu Can you please review this again? We made some changes, the biggest one is that we have decided to make all metrics optional (aka hive will not report any metric by default post the deprecation period).

@berenss Is it okay to assume ACM doesn't use Hive metrics, hence it won't be affected by these changes?

@maorfr Would you be able to help find an appropriate reviewer from AppSRE?

@2uasimojo
Copy link
Member

Thanks @suhanime! This lgtm to open for broader review.

@suhanime
Copy link
Contributor Author

@dustman9000 @patjlm @janboll Added you to the Reviewers. Please review at your earlier convenience.

@berenss and @tzvatot confirmed no impact to ACM and OCM

Waiting for the go-ahead from SREP, App-SRE and @hongkailiu

- [Risks and Mitigations](#risks-and-mitigations)

## Summary
As a Hive user interested in customizing metrics, I would like to specify customizations for individual metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem statement is not entirely clear to me. It seems like there a multiple issues simmering regarding "how to effectively operate hive" and they are all trying to be solved with how hive produces metrics.

I think for an effective metrics solution, we need a concrete operational narrative that defines the role of metrics within it. Currently metrics are being proposed as solutions to:

  • observability
  • alerting
  • troubleshooting (inferring this)

IMO, stake holder expectations need to defined and agreed upon; including those that are not RH SRE. Then we can assess the solution against the problem statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the summary to be more precise - does it help?
I can see your point about having concrete operational narrative around metrics, but to be clear, hive's job is to publish the metrics, and it is upto the consumer how they want to use them. You can very well use the same metric for both observability and alerting. I don't think prometheus and metrics are the solutions for troubleshooting, though you can definitely use them to notice the patterns.
I would also want to point out - as a part of this enhancement, we're just changing some internal code to make things easier - we're not introducing any new metrics. We've also gotten it reviewed from all major consumers of hive.


#### minimumDuration
Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport.metricNames[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same.
In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't a counter just being incremented? Which value do you compare here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad - counter can't have a minimumDuration - I switched it out in the example with a histogram

- if deprecated methods are used along with the new metricsToReport. You can either choose the _old_ way or the _new_ way.


### Risks and Mitigations
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have processes or tools in mind, to prevent acidentialy enabling a lot of metrics?
Or asked in another way, can you predict the impact/outcome of a config change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the intent of asking this, but the entire intent of this proposal is to put the power in admin's hands.
I see some 50+ metrics in our docs, only 7 of them are optional right now and not logged by default, so there isn't a risk of logging too many metrics
Our biggest concern is usually the cardinality - and we can document our concerns and warnings well, we can also include warnings in the docstrings of HiveConfig or maybe the logs (though I'm not sure if anyone checks those warnings in logs).
As far as the prediction of the impact of a config change goes, no we can't - not to the extent you're asking. Real concerns come into play only when a hive instance maintains too many clusters and/or you really want to enforce a per-cluster label on metrics - which simply isn't a good design and not what prometheus or metrics are meant for.
Since we're offering avenues where there could potentially be concerns - we're also offering the counter measures of applying minimum threshold or label selectors - or the option of not publishing the metrics you do not need - so we can lessen the chances of the metrics blowing up.

- Make all hive metrics optional
- Instead of throwing a panic, update hiveconfig ready condition to
  false.
- Add LabelSelector support for clusterDeployment related metrics
- Expand support of minimumDuration and
  additionalClusterDeploymentLabels for all eligible metrics
As a Hive user interested in consuming the metrics it publishes, I would like to be able to configure the metrics as per my needs.
This includes the ability to choose which metrics are published, reduce the amount of observations for certain metrics unless it meets configured conditions, and the ability to add labels to the reported metrics for grouping and filtering purposes.

## Motivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent update here, thanks!

@2uasimojo
Copy link
Member

/lgtm

We may find little things to tweak/add as we implement, but I think this is ready to land. Nice work!

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 10, 2024
Copy link
Contributor

openshift-ci bot commented May 10, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, suhanime

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 83aedb9 into openshift:master May 10, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants