-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MON-3752: expose metric-denylist for KSM #2283
base: master
Are you sure you want to change the base?
Conversation
@rexagod: This pull request references MON-3752 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rexagod The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@rexagod: This pull request references MON-3752 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@rexagod: This pull request references MON-3752 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
9ba7f76
to
c7e7acc
Compare
/test e2e-agnostic-operator |
The description shows the current behavior, which is to replace the deny-list with the one is specified by the user (CMO configuration deny-list takes precedence over the default one). However, owing to this, users could very well find themselves (a) denying certain default metrics that various alerts and dashboards depend upon, or (b) enabling certain default metrics that could cause cardinality issues, |
Adding high cardinality metrics is a concern, but I'm more concerned about users removing metrics that we rely on in dashboards and alerts. |
1bcecea
to
4cdeaf7
Compare
pkg/tasks/kubestatemetrics.go
Outdated
@@ -95,6 +95,21 @@ func (t *KubeStateMetricsTask) Run(ctx context.Context) error { | |||
return fmt.Errorf("reconciling kube-state-metrics Deployment failed: %w", err) | |||
} | |||
|
|||
prometheusK8sTokenSecret, err := t.client.GetTokenSecret(ctx, dep.Namespace, "prometheus-k8s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used the prometheus-k8s-token-...
secret to query the /metrics
endpoint, since that was allowed by KRP.
pkg/manifests/manifests.go
Outdated
} | ||
|
||
// Query the endpoint. | ||
t := time.NewTimer(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm usually not a huge fan of hardcoding times within the code, but I see the point. How about setting a var for this in any case? I know it's only used there but it'd be cool from a readability standpoint and for a latter change if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to this, is a Timer the right tool here? Wouldn't the worst case here be a hot loop creating http requests for 5 seconds?
Wouldn't a Poll variant with timeout and maybe backoff make sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL.
pkg/manifests/manifests.go
Outdated
Path: "/metrics", | ||
} | ||
client := &http.Client{ | ||
Transport: &http.Transport{TLSClientConfig: &tls.Config{InsecureSkipVerify: true}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it too complicated here to find the right TLS artifacts?
The metrics-client-certs
secret should also have mTLS artifacts, this code wouldn't have to retrieve a bearer token for auth I think. Though there might be some fault tolerance implications when using mTLS only. @simonpasquier wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it too complicated here to find the right TLS artifacts?
I missed incorporating this, but I think I might be able to utilize metrics-client-certs
for the TLS certificates, as you mentioned, and perhaps metrics-client-ca
for the CA. I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, I believe we should drop the following from the TLS cipher suite (for all components).
W0416 13:58:20.519009 1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected.
W0416 13:58:20.519041 1 secure_serving.go:69] Use of insecure cipher TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind raising a bug for us to address this?
pkg/manifests/manifests.go
Outdated
// KubeStateMetricsDenylistBoundsCheck ensures that the user is: | ||
// * able to enable metrics that are denied by default, and, | ||
// * unable to disable metrics that are enabled by default. | ||
func (f *Factory) KubeStateMetricsDenylistBoundsCheck(deployment *appsv1.Deployment, service *v1.Service, secret *v1.Secret) (*appsv1.Deployment, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever idea :) but I have some concerns:
- Making an operator depend on one of its operands doesn't seem that common and natural to me. Here, for example, if CMO cannot reach out to KSM's
/metrics
, it'll abort its task and not try to "help it". This could be fixed, but maybe there will be other deadlocks like this. - Does KSM always expose all the metrics that it can expose? I mean, does it initialize all the metrics? If not,
/metrics
will not really be a source of truth: A user could drop a not-there-yet metric. - The code seems a little bit hacky as we're re-implementing some parts of a scraper, and I'm concerned about its maintainability.
I don't have a better alternative but I see that for the majority of metrics in the current --metric-denylist
we specify the exact metrics. How about expanding the ones with xxx_.+_yyy
and just allowing users to select from that list? But maybe it'll be harder to maintain. (we can still have tests with xxx_.+_yyy
to be sure we take new metrics we want to drop into account)
Maybe there is also something to be done in connection with collection profiles...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making an operator depend on one of its operands doesn't seem that common and natural to me. Here, for example, if CMO cannot reach out to KSM's /metrics, it'll abort its task and not try to "help it". This could be fixed, but maybe there will be other deadlocks like this.
Ah, I believe CMO itself won't lean on KSM anymore than it did before this patch. A bounds-check for an intrinsic property (--metric-denylist
) of the operand is nothing more than its own behavior being handled by CMO, same as any other non-repairable (by choice or by trial) error it'd exhibit (for instance, an invalid CRS config). We are failing KSM reconciliation and throwing that error without making any assumptions or implicit conversions to be explicit towards the user on CMO's resolution of their supplied deny-list. PLMK if I missed something.
Does KSM always expose all the metrics that it can expose? I mean, does it initialize all the metrics? If not, /metrics will not really be a source of truth: A user could drop a not-there-yet metric.
Since help-texts will always be present in the exposition data, so we should be safe.
[...] re-implementing some parts of a scraper [...]
Do we already have helpers for this in CMO? The single-scrape logic essentially crafts a request and interprets it, while pinging the /metrics
endpoint with the necessary auth permissions. It seemed like a straightforward implementation, but feel free to drop any suggestions (in addition to Jan's comment above, as I've made those changes locally but I want to solve the auth issue before I push again) that may abstract this and reduce maintainability costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I believe CMO itself won't lean on KSM anymore than it did before this patch. A bounds-check for an intrinsic property (--metric-denylist) of the operand is nothing more than its own behavior being handled by CMO, same as any other non-repairable (by choice or by trial) error it'd exhibit (for instance, an invalid CRS config). We are failing KSM reconciliation and throwing that error without making any assumptions or implicit conversions to be explicit towards the user on CMO's resolution of their supplied deny-list. PLMK if I missed something.
I see, but in this case, it seems to me that if KSM itself is broken (or CMO thinks it’s broken), CMO will not sync KSM, which isn’t the behavior we expect from an operator. For example, if one makes an unintentional change to the KSM Deployment that broke it, CMO will abandon it instead of fixing it as expected.
CMO being unable to proceed because of some third-party dependency that CMO cannot even control/adjust (CR, a resource managed by another operator e.g.) is okay, but CMO not being able to fix an operand because the broken operand isn’t responding to CMO, well, that’s a circular dependency :)
Again, this is just an example, maybe there are others.
Since help-texts will always be present in the exposition data, so we should be safe.
Ok, then KSM initializes all the metrics, (even the empty ones), good to know.
Do we already have helpers for this in CMO? The single-scrape logic essentially crafts a request and interprets it, while pinging the /metrics endpoint with the necessary auth permissions. It seemed like a straightforward implementation, but feel free to drop any suggestions (in addition to Jan's comment above, as I've made those changes locally but I want to solve the auth issue before I push again) that may abstract this and reduce maintainability costs.
I don't think there are any existing utils for this (the only similar ones I see are used for tests), it'll be a new "trait/responsibility" that I'm not sure we should be adding to CMO. But the alternative I'm proposing (to expand thexxx_.+_yyy
) isn't without its flaws.
Of course, if we agree to merge this, maybe we should consider moving that new logic from manifests.go
and adding some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this boils down to a question of trading-off between:
- failing on an invalid deny-list, albeit at the expense of degrading a healthy KSM instance, or,
- subtly emitting or logging the issue, and relying on the cluster-admin/user to diagnose the issue.
I'm okay with wherever the team lands on this, just wanted to point this out.
Ok, then KSM initializes all the metrics, (even the empty ones), good to know.
Ah, sorry, I believe I should've been more explicit. So KSM has a list of default resources that it builds metric stores for from the get-go, but these have to be present in the cluster in order for the informers to trigger the metric family population. If so, these are initialized, and I believe this set should be a safe assumption for all OpenShift variants (I've tested this on hypershift
, but not all of them individually). LMK if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ayoub for raising this, I think its well worth thinking carefully about the failure scenarios here.
Afaiu Ayoubs concern is that an erroneous user supplied deny list will cause unrelated tasks to fail. In the current iteration this is true. If KubeStateMetricsDenylistBoundsCheck
returns an error the KSM PrometheusRule and ServiceMonitor will not be reconciled. Also the config sharing task (in its own task group) will not run.
Maybe the application of the user supplied deny list should run in its own task in the last task group. If that fails at least the remaining stack is already rolled out completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems I missed the updates on this thread.
I wasn't suggesting dynamically changing CPs, but using the minimal CP as the source of truth: make KubeStateMetricsDenylistBoundsCheck rely on it instead of the /metrics output. If we had a minimal CP for KSM, CMO could only accept kubeStateMetrics.metricDenylist that doesn't drop metrics from the minimal CP.
Again, I'm just thinking out loud, maybe this needs more discussion later or somewhere else.
That is doable if the metric expressions within KSM's minimal service monitor are not regexes themselves (at the moment, these are deterministic values surrounded by |
, which can be used to statically validate the user-defined list). However,
- if they are, we'll need to either (a) switch to minimal profile every time we want to validate the denylist (to compile the regex, same problem we solve here by querying
/metrics
), or (b) have a complex regex implementation that verifies if a particular regex is a subset of the other, and additionally, - this will also significantly increase the number of metrics that can be denied by the user in a non-minimal environment, which other components may depend upon (the number of metrics in KSM's
minimal
SMkeep
is much lesser than the metrics not in the default deny-list, giving the user much more control).
I guess this boils down to whether making KSM's minimal
CP the source of truth for all non-minimal
CP environments can be guaranteed to safeguard all metrics that will ever be relied on by any internal component, which IIUC, sounds unsafe. Note that while CP may be moving to GA, components will still need time to adapt to the minimal
environment as there is no such concept being enforced currently (this is exactly what CPV aims to help developers with).
A KSM-specific deny-list will allow users to change their CP environment while making sure a certain set of metrics are always denied, irrespective of the CP being enforced at that moment. The minimal
analogy above isn't applicable in a full
profile (as components expect complete metrics availability), and IIUC, the user will need to rely on the deny-list option in that case to be able to deny any allowed KSM metric.
Actually, I was suggesting to pass the user input to KSM as is, without any validation in CMO, like we do for the node-exporter net interfaces I shared above.
By "validation", I'm assuming you mean the default-denylist subset-check that we do here, in which case, not doing so would give the user way too much control and break the "flexibility-stability" bounds this PR rests on. Not sure I follow the idea here, but feel free to correct me in case I misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Friendly ping. @machine424 PLMK if you'd prefer for me to schedule a call to discuss this thread, since reaching a consensus here in time may help us get this in 4.16. I deferred scheduling a call till now as I thought the discussion may end here and not need a call at all, but PLMK if you think otherwise, and if that would help accelerate things. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pranshu for your answer.
(I don't think we need to switch to minimal
, as CMO always knows about the SM manifest as it's in its assets/, we could also use the current profile instead of the minimal one)
As I mentioned, I'm not trying/aiming to block this PR, even though I don’t 100% agree with the approach ;) If the other colleague are okay with it, let's merge it.
Let's resume the profiles related discussion once they're GA or during their transition to GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to switch to minimal, as CMO always knows about the SM manifest as it's in its assets/, we could also use the current profile instead of the minimal one.
It's not possible to statically compile regexes (PTAL at my first point above), or atleast without a lot of computational (and maintenance) complexity.
As I mentioned, I'm not trying/aiming to block this PR, even though I don’t 100% agree with the approach ;) If the other colleague are okay with it, let's merge it.
I'm not sure I see the cause for the incorporation of collection profiles here at the moment, but I could very well be missing something. Nonetheless, your review has been constructive and critical to this patch, and not at all blocking IMHO. I'd much rather defer this PR to after 4.16, than rush it right now against your acumen. Besides, we have little to gain from merging this PR in 4.16, as there are no such commitments that would tie us to doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll setup a call later this week to continue this discussion. That should make it easier to arrive at a consensus. :)
Squashed all |
/retest-required |
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Introduce peripheral tasks and make deny-list bounds checking more KSM-compliant by interpreting the regexes in the same manner, i.e., instead of matching against a line, match against a metric. Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggesting some small changes to make it a little easier to read.
Documentation/api.md
Outdated
@@ -172,6 +172,7 @@ The `KubeStateMetricsConfig` resource defines settings for the `kube-state-metri | |||
|
|||
| Property | Type | Description | | |||
| -------- | ---- | ----------- | | |||
| metricDenylist | []string | Comma-separated list of metrics not to be enabled. This list comprises exact metric names and/or regex patterns. CMO has a default deny-list that forms the overall scope of the set of metrics that are allowed to be enabled. However, metrics that are not in the default deny-list cannot be disabled by the user, since various OpenShift components rely on them. Doing so will cause the operator to go into a degraded state, until a valid (or empty) list is provided by the user. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| metricDenylist | []string | Comma-separated list of metrics not to be enabled. This list comprises exact metric names and/or regex patterns. CMO has a default deny-list that forms the overall scope of the set of metrics that are allowed to be enabled. However, metrics that are not in the default deny-list cannot be disabled by the user, since various OpenShift components rely on them. Doing so will cause the operator to go into a degraded state, until a valid (or empty) list is provided by the user. | | |
| metricDenylist | []string | A comma-separated list of metrics that are disabled by default. This list comprises exact metric names and/or regex patterns. You can enable the metrics from the CMO default deny-list. However, you cannot disable metrics that are not in the default deny-list, because various OpenShift components rely on them. Doing so causes the Operator to go into a degraded state, until a valid (or empty) list is provided by the user. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
/label docs-approved |
/retest |
re-test PR with cluster-bot, LGTM |
/hold Until ongoing discussions are done with. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -157,3 +162,28 @@ func getSecurityContextRestrictedProfile() *v1.SecurityContext { | |||
}, | |||
} | |||
} | |||
|
|||
func getOrCreateCMOConfig(t *testing.T) (*v1.ConfigMap, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have a look at all the recent changes, but this drew my attention.
are the existing BuildCMOConfigMap
and MustCreateOrUpdateConfigMap
not sufficient for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we can use BuildCMOConfigMap
to construct the ConfigMap here. 👍🏼
… docs Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
@rexagod: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Signed-off-by: Pranshu Srivastava rexagod@gmail.com
CMO Config (with denied metrics)
KSM Logs