Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-3752: expose metric-denylist for KSM #2283

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

rexagod
Copy link
Member

@rexagod rexagod commented Mar 12, 2024

Signed-off-by: Pranshu Srivastava rexagod@gmail.com

CMO Config (with denied metrics)
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    kubeStateMetrics:
      metricDenylist:
        - ^kube_.+_created$
        - ^kube_.+_annotations$
KSM Logs
I0312 21:31:50.753546       1 wrapper.go:120] "Starting kube-state-metrics"
W0312 21:31:50.753856       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0312 21:31:50.754060       1 server.go:192] "Used default resources"
I0312 21:31:50.754100       1 types.go:184] "Using all namespaces"
I0312 21:31:50.754164       1 server.go:225] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: ^kube_.+_annotations$, ^kube_.+_created$"
W0312 21:31:50.754199       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0312 21:31:50.754825       1 utils.go:70] "Tested communication with server"
I0312 21:31:50.771593       1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28" gitVersion="v1.28.7+6e2789b" gitTreeState="clean" gitCommit="dfd36a72760e09d4971f8b49ed335f5522dab8af" platform="linux/amd64"
I0312 21:31:50.771662       1 utils.go:76] "Communication with server successful"
I0312 21:31:50.772149       1 server.go:347] "Started metrics server" metricsServerAddress="127.0.0.1:8081"
I0312 21:31:50.772315       1 metrics_handler.go:99] "Autosharding disabled"
I0312 21:31:50.774043       1 server.go:336] "Started kube-state-metrics self metrics server" telemetryAddress="127.0.0.1:8082"
I0312 21:31:50.776631       1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8082
I0312 21:31:50.776699       1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8082
I0312 21:31:50.778155       1 builder.go:271] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments"
I0312 21:31:50.778438       1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8081
I0312 21:31:50.778514       1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8081                                    

  • I added CHANGELOG entry for this change.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 12, 2024

@rexagod: This pull request references MON-3752 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

Signed-off-by: Pranshu Srivastava rexagod@gmail.com


  • I added CHANGELOG entry for this change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 12, 2024
Copy link
Contributor

openshift-ci bot commented Mar 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 12, 2024

@rexagod: This pull request references MON-3752 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

Signed-off-by: Pranshu Srivastava rexagod@gmail.com

CMO Config (with denied metrics)
apiVersion: v1
kind: ConfigMap
metadata:
 name: cluster-monitoring-config
 namespace: openshift-monitoring
data:
 config.yaml: |
   kubeStateMetrics:
     metricDenylist:
       - ^kube_.+_created$
       - ^kube_.+_annotations$
KSM Logs
I0312 21:31:50.753546       1 wrapper.go:120] "Starting kube-state-metrics"
W0312 21:31:50.753856       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0312 21:31:50.754060       1 server.go:192] "Used default resources"
I0312 21:31:50.754100       1 types.go:184] "Using all namespaces"
I0312 21:31:50.754164       1 server.go:225] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: ^kube_.+_annotations$, ^kube_.+_created$"
W0312 21:31:50.754199       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0312 21:31:50.754825       1 utils.go:70] "Tested communication with server"
I0312 21:31:50.771593       1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28" gitVersion="v1.28.7+6e2789b" gitTreeState="clean" gitCommit="dfd36a72760e09d4971f8b49ed335f5522dab8af" platform="linux/amd64"
I0312 21:31:50.771662       1 utils.go:76] "Communication with server successful"
I0312 21:31:50.772149       1 server.go:347] "Started metrics server" metricsServerAddress="127.0.0.1:8081"
I0312 21:31:50.772315       1 metrics_handler.go:99] "Autosharding disabled"
I0312 21:31:50.774043       1 server.go:336] "Started kube-state-metrics self metrics server" telemetryAddress="127.0.0.1:8082"
I0312 21:31:50.776631       1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8082
I0312 21:31:50.776699       1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8082
I0312 21:31:50.778155       1 builder.go:271] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments"
I0312 21:31:50.778438       1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8081
I0312 21:31:50.778514       1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8081                                    

  • I added CHANGELOG entry for this change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rexagod
Copy link
Member Author

rexagod commented Mar 12, 2024

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 12, 2024

@rexagod: This pull request references MON-3752 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rexagod rexagod force-pushed the 3752 branch 2 times, most recently from 9ba7f76 to c7e7acc Compare March 12, 2024 21:48
@rexagod
Copy link
Member Author

rexagod commented Mar 13, 2024

/test e2e-agnostic-operator

@rexagod
Copy link
Member Author

rexagod commented Mar 14, 2024

The description shows the current behavior, which is to replace the deny-list with the one is specified by the user (CMO configuration deny-list takes precedence over the default one).

However, owing to this, users could very well find themselves (a) denying certain default metrics that various alerts and dashboards depend upon, or (b) enabling certain default metrics that could cause cardinality issues, ^kube_.+_annotations$, for instance.

@jan--f
Copy link
Contributor

jan--f commented Mar 15, 2024

Adding high cardinality metrics is a concern, but I'm more concerned about users removing metrics that we rely on in dashboards and alerts.
The feature request around this is specifically to add previously denied metrics back. Giving users full reign over the denylist strikes me as quite dangerous, as they now easily render the stack mostly useless.

@rexagod rexagod marked this pull request as draft March 28, 2024 00:29
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 28, 2024
@rexagod rexagod force-pushed the 3752 branch 2 times, most recently from 1bcecea to 4cdeaf7 Compare April 3, 2024 00:41
@rexagod rexagod marked this pull request as ready for review April 3, 2024 00:42
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2024
@@ -95,6 +95,21 @@ func (t *KubeStateMetricsTask) Run(ctx context.Context) error {
return fmt.Errorf("reconciling kube-state-metrics Deployment failed: %w", err)
}

prometheusK8sTokenSecret, err := t.client.GetTokenSecret(ctx, dep.Namespace, "prometheus-k8s")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used the prometheus-k8s-token-... secret to query the /metrics endpoint, since that was allowed by KRP.

}

// Query the endpoint.
t := time.NewTimer(5 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm usually not a huge fan of hardcoding times within the code, but I see the point. How about setting a var for this in any case? I know it's only used there but it'd be cool from a readability standpoint and for a latter change if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this, is a Timer the right tool here? Wouldn't the worst case here be a hot loop creating http requests for 5 seconds?
Wouldn't a Poll variant with timeout and maybe backoff make sense here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL.

Path: "/metrics",
}
client := &http.Client{
Transport: &http.Transport{TLSClientConfig: &tls.Config{InsecureSkipVerify: true}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too complicated here to find the right TLS artifacts?
The metrics-client-certs secret should also have mTLS artifacts, this code wouldn't have to retrieve a bearer token for auth I think. Though there might be some fault tolerance implications when using mTLS only. @simonpasquier wdyt?

Copy link
Member Author

@rexagod rexagod Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too complicated here to find the right TLS artifacts?

I missed incorporating this, but I think I might be able to utilize metrics-client-certs for the TLS certificates, as you mentioned, and perhaps metrics-client-ca for the CA. I'll take a look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, I believe we should drop the following from the TLS cipher suite (for all components).

W0416 13:58:20.519009       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected.
W0416 13:58:20.519041       1 secure_serving.go:69] Use of insecure cipher TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind raising a bug for us to address this?

// KubeStateMetricsDenylistBoundsCheck ensures that the user is:
// * able to enable metrics that are denied by default, and,
// * unable to disable metrics that are enabled by default.
func (f *Factory) KubeStateMetricsDenylistBoundsCheck(deployment *appsv1.Deployment, service *v1.Service, secret *v1.Secret) (*appsv1.Deployment, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever idea :) but I have some concerns:

  • Making an operator depend on one of its operands doesn't seem that common and natural to me. Here, for example, if CMO cannot reach out to KSM's /metrics, it'll abort its task and not try to "help it". This could be fixed, but maybe there will be other deadlocks like this.
  • Does KSM always expose all the metrics that it can expose? I mean, does it initialize all the metrics? If not, /metrics will not really be a source of truth: A user could drop a not-there-yet metric.
  • The code seems a little bit hacky as we're re-implementing some parts of a scraper, and I'm concerned about its maintainability.

I don't have a better alternative but I see that for the majority of metrics in the current --metric-denylist we specify the exact metrics. How about expanding the ones with xxx_.+_yyy and just allowing users to select from that list? But maybe it'll be harder to maintain. (we can still have tests with xxx_.+_yyy to be sure we take new metrics we want to drop into account)

Maybe there is also something to be done in connection with collection profiles...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making an operator depend on one of its operands doesn't seem that common and natural to me. Here, for example, if CMO cannot reach out to KSM's /metrics, it'll abort its task and not try to "help it". This could be fixed, but maybe there will be other deadlocks like this.

Ah, I believe CMO itself won't lean on KSM anymore than it did before this patch. A bounds-check for an intrinsic property (--metric-denylist) of the operand is nothing more than its own behavior being handled by CMO, same as any other non-repairable (by choice or by trial) error it'd exhibit (for instance, an invalid CRS config). We are failing KSM reconciliation and throwing that error without making any assumptions or implicit conversions to be explicit towards the user on CMO's resolution of their supplied deny-list. PLMK if I missed something.

Does KSM always expose all the metrics that it can expose? I mean, does it initialize all the metrics? If not, /metrics will not really be a source of truth: A user could drop a not-there-yet metric.

Since help-texts will always be present in the exposition data, so we should be safe.

[...] re-implementing some parts of a scraper [...]

Do we already have helpers for this in CMO? The single-scrape logic essentially crafts a request and interprets it, while pinging the /metrics endpoint with the necessary auth permissions. It seemed like a straightforward implementation, but feel free to drop any suggestions (in addition to Jan's comment above, as I've made those changes locally but I want to solve the auth issue before I push again) that may abstract this and reduce maintainability costs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I believe CMO itself won't lean on KSM anymore than it did before this patch. A bounds-check for an intrinsic property (--metric-denylist) of the operand is nothing more than its own behavior being handled by CMO, same as any other non-repairable (by choice or by trial) error it'd exhibit (for instance, an invalid CRS config). We are failing KSM reconciliation and throwing that error without making any assumptions or implicit conversions to be explicit towards the user on CMO's resolution of their supplied deny-list. PLMK if I missed something.

I see, but in this case, it seems to me that if KSM itself is broken (or CMO thinks it’s broken), CMO will not sync KSM, which isn’t the behavior we expect from an operator. For example, if one makes an unintentional change to the KSM Deployment that broke it, CMO will abandon it instead of fixing it as expected.
CMO being unable to proceed because of some third-party dependency that CMO cannot even control/adjust (CR, a resource managed by another operator e.g.) is okay, but CMO not being able to fix an operand because the broken operand isn’t responding to CMO, well, that’s a circular dependency :)
Again, this is just an example, maybe there are others.

Since help-texts will always be present in the exposition data, so we should be safe.

Ok, then KSM initializes all the metrics, (even the empty ones), good to know.

Do we already have helpers for this in CMO? The single-scrape logic essentially crafts a request and interprets it, while pinging the /metrics endpoint with the necessary auth permissions. It seemed like a straightforward implementation, but feel free to drop any suggestions (in addition to Jan's comment above, as I've made those changes locally but I want to solve the auth issue before I push again) that may abstract this and reduce maintainability costs.

I don't think there are any existing utils for this (the only similar ones I see are used for tests), it'll be a new "trait/responsibility" that I'm not sure we should be adding to CMO. But the alternative I'm proposing (to expand thexxx_.+_yyy) isn't without its flaws.
Of course, if we agree to merge this, maybe we should consider moving that new logic from manifests.go and adding some tests.

Copy link
Member Author

@rexagod rexagod Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this boils down to a question of trading-off between:

  • failing on an invalid deny-list, albeit at the expense of degrading a healthy KSM instance, or,
  • subtly emitting or logging the issue, and relying on the cluster-admin/user to diagnose the issue.

I'm okay with wherever the team lands on this, just wanted to point this out.

Ok, then KSM initializes all the metrics, (even the empty ones), good to know.

Ah, sorry, I believe I should've been more explicit. So KSM has a list of default resources that it builds metric stores for from the get-go, but these have to be present in the cluster in order for the informers to trigger the metric family population. If so, these are initialized, and I believe this set should be a safe assumption for all OpenShift variants (I've tested this on hypershift, but not all of them individually). LMK if you think otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ayoub for raising this, I think its well worth thinking carefully about the failure scenarios here.
Afaiu Ayoubs concern is that an erroneous user supplied deny list will cause unrelated tasks to fail. In the current iteration this is true. If KubeStateMetricsDenylistBoundsCheck returns an error the KSM PrometheusRule and ServiceMonitor will not be reconciled. Also the config sharing task (in its own task group) will not run.

Maybe the application of the user supplied deny list should run in its own task in the last task group. If that fails at least the remaining stack is already rolled out completely.

Copy link
Member Author

@rexagod rexagod May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I missed the updates on this thread.

I wasn't suggesting dynamically changing CPs, but using the minimal CP as the source of truth: make KubeStateMetricsDenylistBoundsCheck rely on it instead of the /metrics output. If we had a minimal CP for KSM, CMO could only accept kubeStateMetrics.metricDenylist that doesn't drop metrics from the minimal CP.
Again, I'm just thinking out loud, maybe this needs more discussion later or somewhere else.

That is doable if the metric expressions within KSM's minimal service monitor are not regexes themselves (at the moment, these are deterministic values surrounded by |, which can be used to statically validate the user-defined list). However,

  • if they are, we'll need to either (a) switch to minimal profile every time we want to validate the denylist (to compile the regex, same problem we solve here by querying /metrics), or (b) have a complex regex implementation that verifies if a particular regex is a subset of the other, and additionally,
  • this will also significantly increase the number of metrics that can be denied by the user in a non-minimal environment, which other components may depend upon (the number of metrics in KSM's minimal SM keep is much lesser than the metrics not in the default deny-list, giving the user much more control).

I guess this boils down to whether making KSM's minimal CP the source of truth for all non-minimal CP environments can be guaranteed to safeguard all metrics that will ever be relied on by any internal component, which IIUC, sounds unsafe. Note that while CP may be moving to GA, components will still need time to adapt to the minimal environment as there is no such concept being enforced currently (this is exactly what CPV aims to help developers with).

A KSM-specific deny-list will allow users to change their CP environment while making sure a certain set of metrics are always denied, irrespective of the CP being enforced at that moment. The minimal analogy above isn't applicable in a full profile (as components expect complete metrics availability), and IIUC, the user will need to rely on the deny-list option in that case to be able to deny any allowed KSM metric.

Actually, I was suggesting to pass the user input to KSM as is, without any validation in CMO, like we do for the node-exporter net interfaces I shared above.

By "validation", I'm assuming you mean the default-denylist subset-check that we do here, in which case, not doing so would give the user way too much control and break the "flexibility-stability" bounds this PR rests on. Not sure I follow the idea here, but feel free to correct me in case I misunderstood.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly ping. @machine424 PLMK if you'd prefer for me to schedule a call to discuss this thread, since reaching a consensus here in time may help us get this in 4.16. I deferred scheduling a call till now as I thought the discussion may end here and not need a call at all, but PLMK if you think otherwise, and if that would help accelerate things. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pranshu for your answer.
(I don't think we need to switch to minimal, as CMO always knows about the SM manifest as it's in its assets/, we could also use the current profile instead of the minimal one)
As I mentioned, I'm not trying/aiming to block this PR, even though I don’t 100% agree with the approach ;) If the other colleague are okay with it, let's merge it.
Let's resume the profiles related discussion once they're GA or during their transition to GA.

Copy link
Member Author

@rexagod rexagod May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to switch to minimal, as CMO always knows about the SM manifest as it's in its assets/, we could also use the current profile instead of the minimal one.

It's not possible to statically compile regexes (PTAL at my first point above), or atleast without a lot of computational (and maintenance) complexity.


As I mentioned, I'm not trying/aiming to block this PR, even though I don’t 100% agree with the approach ;) If the other colleague are okay with it, let's merge it.

I'm not sure I see the cause for the incorporation of collection profiles here at the moment, but I could very well be missing something. Nonetheless, your review has been constructive and critical to this patch, and not at all blocking IMHO. I'd much rather defer this PR to after 4.16, than rush it right now against your acumen. Besides, we have little to gain from merging this PR in 4.16, as there are no such commitments that would tie us to doing so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll setup a call later this week to continue this discussion. That should make it easier to arrive at a consensus. :)

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024
@rexagod
Copy link
Member Author

rexagod commented Apr 23, 2024

Squashed all fixup!s.

@rexagod
Copy link
Member Author

rexagod commented Apr 24, 2024

/retest-required

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2024
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Introduce peripheral tasks and make deny-list bounds checking more
KSM-compliant by interpreting the regexes in the same manner, i.e.,
instead of matching against a line, match against a metric.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2024
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Copy link

@eromanova97 eromanova97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting some small changes to make it a little easier to read.

@@ -172,6 +172,7 @@ The `KubeStateMetricsConfig` resource defines settings for the `kube-state-metri

| Property | Type | Description |
| -------- | ---- | ----------- |
| metricDenylist | []string | Comma-separated list of metrics not to be enabled. This list comprises exact metric names and/or regex patterns. CMO has a default deny-list that forms the overall scope of the set of metrics that are allowed to be enabled. However, metrics that are not in the default deny-list cannot be disabled by the user, since various OpenShift components rely on them. Doing so will cause the operator to go into a degraded state, until a valid (or empty) list is provided by the user. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| metricDenylist | []string | Comma-separated list of metrics not to be enabled. This list comprises exact metric names and/or regex patterns. CMO has a default deny-list that forms the overall scope of the set of metrics that are allowed to be enabled. However, metrics that are not in the default deny-list cannot be disabled by the user, since various OpenShift components rely on them. Doing so will cause the operator to go into a degraded state, until a valid (or empty) list is provided by the user. |
| metricDenylist | []string | A comma-separated list of metrics that are disabled by default. This list comprises exact metric names and/or regex patterns. You can enable the metrics from the CMO default deny-list. However, you cannot disable metrics that are not in the default deny-list, because various OpenShift components rely on them. Doing so causes the Operator to go into a degraded state, until a valid (or empty) list is provided by the user. |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
@eromanova97
Copy link

/label docs-approved

@openshift-ci openshift-ci bot added the docs-approved Signifies that Docs has signed off on this PR label Apr 24, 2024
@jan--f
Copy link
Contributor

jan--f commented Apr 24, 2024

/retest

@Tai-RedHat
Copy link

re-test PR with cluster-bot, LGTM

@rexagod
Copy link
Member Author

rexagod commented May 6, 2024

/hold

Until ongoing discussions are done with.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2024
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024
@@ -157,3 +162,28 @@ func getSecurityContextRestrictedProfile() *v1.SecurityContext {
},
}
}

func getOrCreateCMOConfig(t *testing.T) (*v1.ConfigMap, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have a look at all the recent changes, but this drew my attention.
are the existing BuildCMOConfigMap and MustCreateOrUpdateConfigMap not sufficient for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we can use BuildCMOConfigMap to construct the ConfigMap here. 👍🏼

… docs

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Copy link
Contributor

openshift-ci bot commented May 13, 2024

@rexagod: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/go-fmt e3ae58a link true /test go-fmt
ci/prow/shellcheck e3ae58a link true /test shellcheck
ci/prow/e2e-aws-ovn-techpreview e3ae58a link true /test e2e-aws-ovn-techpreview
ci/prow/unit e3ae58a link true /test unit
ci/prow/verify e3ae58a link true /test verify
ci/prow/rules e3ae58a link true /test rules
ci/prow/versions e3ae58a link false /test versions
ci/prow/e2e-aws-ovn e3ae58a link true /test e2e-aws-ovn
ci/prow/images e3ae58a link true /test images
ci/prow/jsonnet-fmt e3ae58a link true /test jsonnet-fmt
ci/prow/golangci-lint e3ae58a link true /test golangci-lint
ci/prow/vendor e3ae58a link true /test vendor
ci/prow/e2e-aws-ovn-single-node e3ae58a link false /test e2e-aws-ovn-single-node
ci/prow/e2e-agnostic-operator e3ae58a link true /test e2e-agnostic-operator
ci/prow/generate e3ae58a link true /test generate
ci/prow/e2e-aws-ovn-upgrade e3ae58a link true /test e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. docs-approved Signifies that Docs has signed off on this PR jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants