-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement for metricsConfig redesign #2182
Conversation
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff @suhanime, thank you.
|
||
It can get confusing for consumers to know the fixed labels each metric has.Update the [hive_metrics](https://github.com/openshift/hive/blob/master/docs/hive_metrics.md) doc to list out all labels and keep it up-to-date. | ||
|
||
Also, the more customizations there are, the longer hiveConfig will be. Currently, we view this as an acceptable change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm, this is a good point. There's a limit on the size of a k8s object. We should probably do a little bit of back-of-napkin math to see roughly what order of magnitude this config could be for scenarios where e.g. every metric gets configured for platform and version labels.
Today on hivep02ue1 (all prod shards should be identical, or near enough as makes no difference):
$ oc get hiveconfig hive -o yaml | wc -c
77278
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, k8s object can be about 3MB in size, and given the size:lines ratio of prod and staging hiveconfigs, we if we consider 100 bytes per line, we can assume upto 30000(?) lines allowed for hiveconfig. Right now, one of those is already at 800+ lines.
My rough calculation shows about 29 metrics that would be classified as clusterDeploymentMetrics, and if we assume filters as the example (say 15 lines per metric), we're clocking at 400+ lines extra if all affected metrics are defined. While we can assume we won't exceed the size restraints of k8s object, it would certainly be not easy to read the hiveconfig and we'd have to be mindful about this if we plan to ever add more customization for other metrics that are not related to clusterDeployment (like clusterPool, syncSets, selectorSyncSets, hive Operator and controller related metrics).
I have proposed grouping of metrics per customization, so instead of each entry of clusterDeploymentRelatedMetrics being an individual metric, we can instead provide a list of metrics. This would make the implementation more tricky, but it would be a more robust design, and of course, the admin is free to provide just 1 metric instead of an entire list.
This would also make it easy for admins to group filters (like do X for these aro clusters for these Y metrics we care about). We'd need to have strict validation to ensure there are no ambiguous entry.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing the numbers. Given what you've computed, I'm really not worried about exceeding the k8s object size limit, even if we extend this to every metric we produce, and even if we add more later.
I'm also not worried about hiveconfig being awkward for humans to read. That ship has sailed, what with all the privatelink jazz in there. I'm fine with the user doing jq/yq wizardry if they really need something they can't just do a quick text search for.
That said, I do like your idea of being able to group filters by supplying a list of metric names. Let me take a closer look at the proposal and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Later] Having seen it, I still like it, and don't see any obvious gotchas. Let's see what other reviewers think, but I'm inclined to go with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@2uasimojo ready for another pass
|
||
It can get confusing for consumers to know the fixed labels each metric has.Update the [hive_metrics](https://github.com/openshift/hive/blob/master/docs/hive_metrics.md) doc to list out all labels and keep it up-to-date. | ||
|
||
Also, the more customizations there are, the longer hiveConfig will be. Currently, we view this as an acceptable change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, k8s object can be about 3MB in size, and given the size:lines ratio of prod and staging hiveconfigs, we if we consider 100 bytes per line, we can assume upto 30000(?) lines allowed for hiveconfig. Right now, one of those is already at 800+ lines.
My rough calculation shows about 29 metrics that would be classified as clusterDeploymentMetrics, and if we assume filters as the example (say 15 lines per metric), we're clocking at 400+ lines extra if all affected metrics are defined. While we can assume we won't exceed the size restraints of k8s object, it would certainly be not easy to read the hiveconfig and we'd have to be mindful about this if we plan to ever add more customization for other metrics that are not related to clusterDeployment (like clusterPool, syncSets, selectorSyncSets, hive Operator and controller related metrics).
I have proposed grouping of metrics per customization, so instead of each entry of clusterDeploymentRelatedMetrics being an individual metric, we can instead provide a list of metrics. This would make the implementation more tricky, but it would be a more robust design, and of course, the admin is free to provide just 1 metric instead of an entire list.
This would also make it easy for admins to group filters (like do X for these aro clusters for these Y metrics we care about). We'd need to have strict validation to ensure there are no ambiguous entry.
What do you think?
6c7e0b0
to
568b41b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great. I think the only crucial issues at this stage are
- Deciding whether to make all affected metrics off by default
- Deciding on default behavior (if neither metrics config is provided) during the deprecation period.
With a read through the design doc, I do not think there would be any issue from SREP side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through this, I have several questions:
- Is there/should there be an ADR/DDR for this? I was working on https://docs.google.com/document/d/1lYK_S3LCt-gCwV9d4Y5Oe2FeBfQ_6C_EyudCRtOrw3k/edit#heading=h.bupciudrwmna which kinds of overlap
- What will be the frequency of updating the metrics? CS is resonciling on the entire fleet, so we should be minded of the performance penalty of getting a load of metrics update simultaniosly.
- While this may solve the hive domain, we still have the same challange with ACM/Hypershift. It would be better to have a single way of reporting metrics (a new CRD?) that we can potentially re-use in the HCP domain. Having diverged solutions (hive, HCP, and whatever comes next) slows down development process.
- Added a comment on the jira ticket referring on the examples provided there: https://issues.redhat.com/browse/HIVE-2344?focusedId=24065859&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24065859
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tzvatot. Thank you for reviewing! TL;DR we expect this enhancement to be of little interest to CS, but wanted to get your eyes on it just in case :)
- I don't believe there's overlap with that ADR, which IIUC is looking to get away from using metrics to gather per-cluster data. That's appropriate IMO: metrics should aggregate trends, not report on individual clusters. That way lies cardinality hell.
- We update the metrics continuously. Some are updated realtime as we reconcile objects that affect them. Others are updated periodically (/2m) as we poll those objects. In all cases there is a certain amount of delay built in based on the scrape interval of the prometheus service. All standard prometheus patterns I believe.
- I can't see a world where it makes sense to report metrics in a CRD, nor where it makes sense for two different services (hive and hypershift) to attempt to converge on design of specific metrics. I suspect we're conflating use cases here. It makes sense to converge on reporting metadata as described by the ADR you linked -- but that's a different thing entirely.
- Ack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@2uasimojo ready for another pass.
I think we still differ on whether all metrics should be made optional or not, and I personally would like to handle that separately. Let's discuss some more on that.
6be1199
to
9a437db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@2uasimojo Hopefully this is it?
|
||
#### Failure modes | ||
|
||
These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so I looked at your suggestion for registering the metric in hiveOperator reconcile, but it wouldn't work correctly because we would need to share the registry with which we're registering the metric. I'm still unsure if we can make it work, and then there's separate registration for HiveOperator and Hive controller metrics - so I didn't specify the implementation details here, just the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, hold on, what harm would there actually be in registering them if we never report them? Doesn't the /metrics endpoint only show time series with values, or am I remembering that wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Later] We could also create a separate registry just for this purpose, and throw it away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re "metrics not getting registered", I think we'll actually end up just... not deploying the controllers at all, right? Since hive-operator will be erroring out of its reconcile loop? So indeed metrics wouldn't get registered... because the thing that would be registering them isn't running :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, I have never actually implemented registering the metric in one controller and reporting it via another - given that a lot of these implementation details are handled by the metrics library, we would have to try it to know it. Regardless of how we choose to implement it (yeah, I'm making this a future-me problem), the result will be the hiveConfig ReadyCondition getting set to false, so that's the only thing I'm choosing to mention in the enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay, this may help:
We would be registering the metrics in hive-operator only as a means to validate that they work properly.
We will also need to register them in the controllers. Those are separate processes running in separate pods, so they can't share one registration.
(But they can share source code, and should, since the registrations themselves should be identical in both places.)
These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:. | |
These situations will result in hive-operator failing to deploy the controllers. It will set hiveConfig.Status.Conditions[ReadyCondition] to `"False"` with an appropriate error message. |
|
||
|
||
### Risks and Mitigations | ||
The biggest risk is the sheer number of metrics that are going to be affected. This would require thorough testing to ensure no metric changes its behaviour unexpectedly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the calculation for the total metrics affected. All duration-based metrics should support minimumDuration, regardless of whether they're related to cluster deployment or not, but the other 2 customizations can only be supported if we have a cluster deployment available. I decided to let those implementation details be clarified in the hive docs for metrics once we're done with these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it matters, now 17 metrics will support minimumDuration, 6 of them already have it so 11 more than we already do. Rest of the calcs stay the same - 19 more will have additional label support and 29 will support label selection.
52 total metrics that we can report as of now, so 46 metrics of them will now become optional.
We originally wanted the math because we were worried about exceeding size of hiveconfig, since that's no longer an issue - no point calling it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this is getting real close. Only really one substantive change (schema compat discrepancy I missed when I suggested the current hiveconfig shape).
|
||
#### Failure modes | ||
|
||
These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Later] We could also create a separate registry just for this purpose, and throw it away.
|
||
#### Failure modes | ||
|
||
These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re "metrics not getting registered", I think we'll actually end up just... not deploying the controllers at all, right? Since hive-operator will be erroring out of its reconcile loop? So indeed metrics wouldn't get registered... because the thing that would be registering them isn't running :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@2uasimojo Final Pass?
|
||
#### Failure modes | ||
|
||
These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, I have never actually implemented registering the metric in one controller and reporting it via another - given that a lot of these implementation details are handled by the metrics library, we would have to try it to know it. Regardless of how we choose to implement it (yeah, I'm making this a future-me problem), the result will be the hiveConfig ReadyCondition getting set to false, so that's the only thing I'm choosing to mention in the enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you agree with my suggestions, we can open this up for broader review now.
There can only be 1 entry for `metricsToReport`, however it can have multiple entries for `metricNames`. | ||
|
||
#### metricNames | ||
`metricNames` would be a non-empty list of hive metrics that need to be reported, and can have optional customizations. | ||
All customizations listed for an entry of `metricNames` will apply to all the metrics provided in the corresponding list. This allows for grouping filters for a list of relevant metrics. | ||
While multiple entries of `metricNames` are allowed within `metricsToReport`, no duplicate entries of a hive metric will be allowed in `metricsToReport` inorder to avoid ambiguity. | ||
Implementation will be adapting the existing setup for optional metrics. However, instead of a shorthand camelCase key that is used for current `hiveconfig.spec.metricsConfig.metricsWithDuration`, we would now need the full metric name listed under `metricNames`. | ||
In the example above, hive would only log `hive_foo_counter`, `hive_foo_gauge` and `hive_foo_histogram` metrics. | ||
|
||
#### minimumDuration | ||
Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport.metricNames[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same. | ||
In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes. | ||
|
||
#### additionalClusterDeploymentLabels | ||
Deprecate current implementation of hiveConfig.spec.metricsConfig.additionalClusterDeploymentLabels and change it to be reported as metricsConfig.metricsToReport.metricNames[].additionalClusterDeploymentLabels. Its implementation will not change. | ||
In the example above, `hive_foo_counter` and `hive_foo_gauge` would report an additional label `prom_label_name`, its value corresponding to the value of `hive.openshift.io/cd-label-key` label on the corresponding clusterDeployment. | ||
|
||
#### clusterDeploymentLabelSelector | ||
This would be a new feature, of type [LabelSelector](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#LabelSelector), and we'd use the LabelSelector.MatchLabels and/or LabelSelector.MatchExpressions to match the conditions in order to decide if a metric should be reported. | ||
This encapsulates slightly more advanced filter logic over the existing clusterDeployment labels. In the example above, `hive_foo_counter` and `hive_foo_gauge` metrics will only be reported for the clusterDeployments that are labelled with aro-snowflake and not in limited support. | ||
|
||
### Implementation Details / Notes | ||
|
||
- All the options configured for a metric within metricNames, will work in tandem with each other. For ex, if all possible options are specified for a metric, then that metric will be reported with the additional labels as per additionalClusterDeploymentLabels, and will be reported only if it matches the labels and/or expressions as per clusterDeploymentLabelSelector and if the duration to be reported exceeds the minimumDuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the reader will probably "get it", but this still isn't quiiite right. Let me see if I can reword...
There can only be 1 entry for `metricsToReport`, however it can have multiple entries for `metricNames`. | |
#### metricNames | |
`metricNames` would be a non-empty list of hive metrics that need to be reported, and can have optional customizations. | |
All customizations listed for an entry of `metricNames` will apply to all the metrics provided in the corresponding list. This allows for grouping filters for a list of relevant metrics. | |
While multiple entries of `metricNames` are allowed within `metricsToReport`, no duplicate entries of a hive metric will be allowed in `metricsToReport` inorder to avoid ambiguity. | |
Implementation will be adapting the existing setup for optional metrics. However, instead of a shorthand camelCase key that is used for current `hiveconfig.spec.metricsConfig.metricsWithDuration`, we would now need the full metric name listed under `metricNames`. | |
In the example above, hive would only log `hive_foo_counter`, `hive_foo_gauge` and `hive_foo_histogram` metrics. | |
#### minimumDuration | |
Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport.metricNames[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same. | |
In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes. | |
#### additionalClusterDeploymentLabels | |
Deprecate current implementation of hiveConfig.spec.metricsConfig.additionalClusterDeploymentLabels and change it to be reported as metricsConfig.metricsToReport.metricNames[].additionalClusterDeploymentLabels. Its implementation will not change. | |
In the example above, `hive_foo_counter` and `hive_foo_gauge` would report an additional label `prom_label_name`, its value corresponding to the value of `hive.openshift.io/cd-label-key` label on the corresponding clusterDeployment. | |
#### clusterDeploymentLabelSelector | |
This would be a new feature, of type [LabelSelector](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#LabelSelector), and we'd use the LabelSelector.MatchLabels and/or LabelSelector.MatchExpressions to match the conditions in order to decide if a metric should be reported. | |
This encapsulates slightly more advanced filter logic over the existing clusterDeployment labels. In the example above, `hive_foo_counter` and `hive_foo_gauge` metrics will only be reported for the clusterDeployments that are labelled with aro-snowflake and not in limited support. | |
### Implementation Details / Notes | |
- All the options configured for a metric within metricNames, will work in tandem with each other. For ex, if all possible options are specified for a metric, then that metric will be reported with the additional labels as per additionalClusterDeploymentLabels, and will be reported only if it matches the labels and/or expressions as per clusterDeploymentLabelSelector and if the duration to be reported exceeds the minimumDuration | |
`metricsToReport` is a list, each element of which requests reporting and configures filtering and customizations for the metrics provided in its `metricNames` list. | |
#### metricNames | |
`metricNames` would be a non-empty list of hive metrics that need to be reported, and can have optional customizations. | |
All customizations listed for an entry of `metricToReport` will apply to all the metrics provided in that entry's `metricNames` list. This allows for grouping filters for a list of relevant metrics. | |
A metric name must appear at most once across all `metricsToReport[].metricNames[]` in order to avoid ambiguity. | |
Implementation will be adapting the existing setup for optional metrics. However, instead of a shorthand camelCase key that is used for current `hiveconfig.spec.metricsConfig.metricsWithDuration`, we would now need the full metric name listed under `metricNames`. | |
In the example above, hive would only log `hive_foo_counter`, `hive_foo_gauge` and `hive_foo_histogram` metrics. | |
#### minimumDuration | |
Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same. | |
In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes. | |
#### additionalClusterDeploymentLabels | |
Deprecate current implementation of hiveConfig.spec.metricsConfig.additionalClusterDeploymentLabels and change it to be reported as metricsConfig.metricsToReport[].additionalClusterDeploymentLabels. Its implementation will not change. | |
In the example above, `hive_foo_counter` and `hive_foo_gauge` would report an additional label `prom_label_name`, its value corresponding to the value of `hive.openshift.io/cd-label-key` label on the corresponding clusterDeployment. | |
#### clusterDeploymentLabelSelector | |
This would be a new feature, of type [LabelSelector](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#LabelSelector), and we'd use the LabelSelector.MatchLabels and/or LabelSelector.MatchExpressions to match the conditions in order to decide if a metric should be reported. | |
This encapsulates slightly more advanced filter logic over the existing clusterDeployment labels. In the example above, `hive_foo_counter` and `hive_foo_gauge` metrics will only be reported for the clusterDeployments that are labelled with aro-snowflake and not in limited support. | |
### Implementation Details / Notes | |
- All the options configured for a `metricsToReport` entry will work in tandem with each other for each metric listed in its `metricNames`. For ex, if all possible options are specified for a metric, then that metric will be reported with the additional labels as per additionalClusterDeploymentLabels, and will be reported only if it matches the labels and/or expressions as per clusterDeploymentLabelSelector and if the duration to be reported exceeds the minimumDuration |
|
||
#### Failure modes | ||
|
||
These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay, this may help:
We would be registering the metrics in hive-operator only as a means to validate that they work properly.
We will also need to register them in the controllers. Those are separate processes running in separate pods, so they can't share one registration.
(But they can share source code, and should, since the registrations themselves should be identical in both places.)
These situations will result in metrics not getting registered and hiveConfig.Status.Conditions[ReadyCondition] will be set to false:. | |
These situations will result in hive-operator failing to deploy the controllers. It will set hiveConfig.Status.Conditions[ReadyCondition] to `"False"` with an appropriate error message. |
- metricsToReport.metricNames[$name] doesn't exist as a metric. | ||
- metricsToReport.metricNames[$name] is a metric without duration but minimumDuration was specified. | ||
- there are multiple entries of metricsToReport.metricNames[$name]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- metricsToReport.metricNames[$name] doesn't exist as a metric. | |
- metricsToReport.metricNames[$name] is a metric without duration but minimumDuration was specified. | |
- there are multiple entries of metricsToReport.metricNames[$name]. | |
- `metricsToReport[].metricNames[$name]` doesn't exist as a metric. | |
- `metricNames` lists a metric without duration but `minimumDuration` is specified in the same `metricsToReport` entry.. | |
- the same metric is mentioned more than once across all `metricsToReport[].metricNames[]`. |
@2uasimojo Implemented all your suggestions and rebased, this is ready for a broader review. @bmeng, @tzvatot and @hongkailiu Can you please review this again? We made some changes, the biggest one is that we have decided to make all metrics optional (aka hive will not report any metric by default post the deprecation period). @berenss Is it okay to assume ACM doesn't use Hive metrics, hence it won't be affected by these changes? @maorfr Would you be able to help find an appropriate reviewer from AppSRE? |
Thanks @suhanime! This lgtm to open for broader review. |
@dustman9000 @patjlm @janboll Added you to the Reviewers. Please review at your earlier convenience. @berenss and @tzvatot confirmed no impact to ACM and OCM Waiting for the go-ahead from SREP, App-SRE and @hongkailiu |
- [Risks and Mitigations](#risks-and-mitigations) | ||
|
||
## Summary | ||
As a Hive user interested in customizing metrics, I would like to specify customizations for individual metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem statement is not entirely clear to me. It seems like there a multiple issues simmering regarding "how to effectively operate hive" and they are all trying to be solved with how hive produces metrics.
I think for an effective metrics solution, we need a concrete operational narrative that defines the role of metrics within it. Currently metrics are being proposed as solutions to:
- observability
- alerting
- troubleshooting (inferring this)
IMO, stake holder expectations need to defined and agreed upon; including those that are not RH SRE. Then we can assess the solution against the problem statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the summary to be more precise - does it help?
I can see your point about having concrete operational narrative around metrics, but to be clear, hive's job is to publish the metrics, and it is upto the consumer how they want to use them. You can very well use the same metric for both observability and alerting. I don't think prometheus and metrics are the solutions for troubleshooting, though you can definitely use them to notice the patterns.
I would also want to point out - as a part of this enhancement, we're just changing some internal code to make things easier - we're not introducing any new metrics. We've also gotten it reviewed from all major consumers of hive.
|
||
#### minimumDuration | ||
Deprecate the current implementation of hiveConfig.spec.metricsConfig.metricsWithDuration, and change it to be reported as metricsConfig.metricsToReport.metricNames[].minimumDuration. The implementation of using the duration as a threshold before we report the metric stays the same. | ||
In the example above, `hive_foo_counter` and `hive_foo_gauge` will only be logged if the value reported for them exceeds 10 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't a counter just being incremented? Which value do you compare here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad - counter can't have a minimumDuration - I switched it out in the example with a histogram
- if deprecated methods are used along with the new metricsToReport. You can either choose the _old_ way or the _new_ way. | ||
|
||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have processes or tools in mind, to prevent acidentialy enabling a lot of metrics?
Or asked in another way, can you predict the impact/outcome of a config change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the intent of asking this, but the entire intent of this proposal is to put the power in admin's hands.
I see some 50+ metrics in our docs, only 7 of them are optional right now and not logged by default, so there isn't a risk of logging too many metrics
Our biggest concern is usually the cardinality - and we can document our concerns and warnings well, we can also include warnings in the docstrings of HiveConfig or maybe the logs (though I'm not sure if anyone checks those warnings in logs).
As far as the prediction of the impact of a config change goes, no we can't - not to the extent you're asking. Real concerns come into play only when a hive instance maintains too many clusters and/or you really want to enforce a per-cluster label on metrics - which simply isn't a good design and not what prometheus or metrics are meant for.
Since we're offering avenues where there could potentially be concerns - we're also offering the counter measures of applying minimum threshold or label selectors - or the option of not publishing the metrics you do not need - so we can lessen the chances of the metrics blowing up.
- Make all hive metrics optional - Instead of throwing a panic, update hiveconfig ready condition to false. - Add LabelSelector support for clusterDeployment related metrics - Expand support of minimumDuration and additionalClusterDeploymentLabels for all eligible metrics
As a Hive user interested in consuming the metrics it publishes, I would like to be able to configure the metrics as per my needs. | ||
This includes the ability to choose which metrics are published, reduce the amount of observations for certain metrics unless it meets configured conditions, and the ability to add labels to the reported metrics for grouping and filtering purposes. | ||
|
||
## Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent update here, thanks!
/lgtm We may find little things to tweak/add as we implement, but I think this is ready to land. Nice work! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 2uasimojo, suhanime The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
xref: https://issues.redhat.com/browse/HIVE-2344
/assign @2uasimojo