Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-13153: Limit the value of GOMAXPROCS on node-exporter. #1996

Merged
merged 1 commit into from Jun 27, 2023

Conversation

machine424
Copy link
Contributor

@machine424 machine424 commented Jun 12, 2023

xxxxx

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.
  • Add to jsonnet
  • Add tests
  • Adjust commit message

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 12, 2023
@openshift-ci-robot
Copy link
Contributor

@machine424: This pull request references Jira Issue OCPBUGS-13153, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

xxxxx

  • I added CHANGELOG entry for this change.

  • No user facing changes, so no entry in CHANGELOG was needed.

  • Add to jsonnet

  • Add tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@machine424: This pull request references Jira Issue OCPBUGS-13153, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

xxxxx

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.
  • Add to jsonnet
  • Add tests
  • Adjust commit message

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@machine424
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@machine424: This pull request references Jira Issue OCPBUGS-13153, which is invalid:

  • expected the bug to target only the "4.14.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@machine424 machine424 force-pushed the node-maxprocs branch 2 times, most recently from 7d40749 to 08aa416 Compare June 12, 2023 12:38
@machine424
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 12, 2023
@openshift-ci-robot
Copy link
Contributor

@machine424: This pull request references Jira Issue OCPBUGS-13153, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@machine424: This pull request references Jira Issue OCPBUGS-13153, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

xxxxx

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.
  • Add to jsonnet
  • Add tests
  • Adjust commit message

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@machine424: This pull request references Jira Issue OCPBUGS-13153, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

xxxxx

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.
  • Add to jsonnet
  • Add tests
  • Adjust commit message

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@machine424: This pull request references Jira Issue OCPBUGS-13153, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

xxxxx

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.
  • Add to jsonnet
  • Add tests
  • Adjust commit message

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@machine424
Copy link
Contributor Author

machine424 commented Jun 12, 2023

I'll add a test that assures --runtime.gomaxproc precedence in another commit to ease backporting.

this will be done in a separate PR #1998

machine424 added a commit to machine424/cluster-monitoring-operator that referenced this pull request Jun 12, 2023
…terGoMaxProcs tests

Do it in a seperate PR to ease backporitng
openshift#1996
as the maxprocs config wasn't present in < 4.14

Signed-off-by: Ayoub Mrini <amrini@redhat.com>
t, 5*time.Minute, `max(go_sched_gomaxprocs_threads{job="node-exporter"})`,
func(v float64) error {
if v > 4 {
return fmt.Errorf(`expecting max(go_sched_gomaxprocs_threads{job="node-exporter"}) <= 4 but got %v.`, v)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go_sched_gomaxprocs_threads seems to only be present on some prom_go versions? go versions >=1.19?), I'll adjust the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was removed by default in recent versions of prom go client (see prometheus/client_golang#1102), and node-exporter does not seem to add them (they didn't add them back since, see https://github.com/prometheus/node_exporter/blame/bdc430af2b63e0556b591625f86cff60ff88c242/node_exporter.go#L64 ), and we'll not ask for them just for a test.

I remove the test, we can add one that check the "Set GOMAXPROCS=N" in logs, I cannot think of another that wouldn't make the patch more complex.

@jan--f
Copy link
Contributor

jan--f commented Jun 13, 2023

This seems to be a recurring issue for golang based containers in k8s. While this fixes the customer issue reported I wonder if we should think about a different solution, one that is easier to test and that fixes this once and for all.

Reading about this one often finds go.uber.org/automaxproc mentioned and prometheus itself has made use of that. Maybe node-exporter could benefit from a similar solution?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2023
@machine424
Copy link
Contributor Author

This seems to be a recurring issue for golang/go#33803 based prometheus-operator/prometheus-operator#501 in k8s. While this fixes the customer issue reported I wonder if we should think about a different solution, one that is easier to test and that fixes this once and for all.

Reading about this one often finds go.uber.org/automaxproc mentioned and https://github.com/prometheus/prometheus/pull/10498has made use of that. Maybe node-exporter could benefit from a similar solution?

Yes, I still don't understand why go.uber.org/automaxproc hasn't been integrated into Go yet, I've always found limiting GOMAXPROCS handy to calm down some programs when running them on big machines (it helped me once calm down Prometheus before they add automaxproc).

We think about implementing the same for node-exporter, but it wouldn't help in this case as we don't set CPU limits, maybe we should think about setting them...

(Actually, the CU runs a single node cluster (with 4 CPU reserved to platform components) and normally, the default value of GOMAXPROCS takes CPU affinity into account https://github.com/golang/go/blob/bce7aec3cdca8580585095007e9b7cea11a8812f/src/runtime/os_linux.go#L107, but the affinity doesn't seem to be set for node-exporter as its CPU requests is not an integer I think)

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2023
@jan--f
Copy link
Contributor

jan--f commented Jun 13, 2023

We think about implementing the same for node-exporter, but it wouldn't help in this case as we don't set CPU limits, maybe we should think about setting them...

We are thinking about that already ;)
https://issues.redhat.com/browse/MON-3168

but the affinity doesn't seem to be set for node-exporter as its CPU requests is not an integer I think)

Not sure I understand this part. I'm a bit torn, I feel like we can improve our understanding of what's going on here but at the same time I don't have anything constructive and will be out for the rest of the week. Maybe @openshift/openshift-team-monitoring can chime in here?

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the desire to fix it once for all in node_exporter and the automaxproc library + ability to configure CPU limits in CMO could be the right solution (btw we don't even need the automaxprocs lib as we could inject the CPU limits via the downward API). At the same time, this patch provides a tactical solution that should be relatively easy to backport...

jsonnet/components/node-exporter.libsonnet Outdated Show resolved Hide resolved
jsonnet/components/node-exporter.libsonnet Outdated Show resolved Hide resolved
@raptorsun
Copy link
Contributor

raptorsun commented Jun 19, 2023

We need to check the env var GOMAXPROCS is not overwriting the settings nodeExporter.maxProcs just added by https://issues.redhat.com//browse/MON-2894

According to the comment in the runtime code, the main function of node exporter should be called after setting the process count by the env var GOMAXPROCS. It should not overwrite the setting of GOMAXPROCS in node exporter.

But I suggest to give it a test to be sure :)

On nodes with multiple CPU cores, this should help avoid lock contentions
without having any side effects, see the ticket for more details.

node-exporter versions that have "--runtime.gomaxprocs" can still
override this: the flag has precedence over this automatic setting.

Make the entrypoint set the env var before runnning the process.
Get the CPU count from /proc/cpuinfo without taking CPU affinity (via cpuset)
into account as the container's CPU requests is not an integer, thus no
affinity is applied.

Signed-off-by: Ayoub Mrini <amrini@redhat.com>
@machine424
Copy link
Contributor Author

We need to check the env var GOMAXPROCS is not overwriting the settings nodeExporter.maxProcs just added by https://issues.redhat.com//browse/MON-2894

According to the comment in the runtime code, the main function of node exporter should be called after setting the process count by the env var GOMAXPROCS. It should not overwrite the setting of GOMAXPROCS in node exporter.

But I suggest to give it a test to be sure :)

Yes, you're right, it's clear that the flag has precedence prometheus/node_exporter@fb26b9f, but I had done some manual tests and the flag does have precedence in >= 4.14

I even wanted to add the check to the flag e2e test https://github.com/openshift/cluster-monitoring-operator/pull/1998/files#diff-28d0ddad61ad014c06899788cff47f72864db5df662c49235c59cee301e74b57R243 but removed it as the go_sched_gomaxprocs_threads metric is not available in latest node exporter versions #1996 (comment) (maybe we can see with if they will expose the metric again or want to expose gomaxprocs and other params on a /path as prometheus does and …we can add the check back in the future)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 20, 2023

@machine424: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/versions 58997b8 link false /test versions

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@simonpasquier
Copy link
Contributor

/skip

@jan--f
Copy link
Contributor

jan--f commented Jun 27, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 27, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f, machine424

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2023
@openshift-merge-robot openshift-merge-robot merged commit 2bb7887 into openshift:master Jun 27, 2023
16 checks passed
@openshift-ci-robot
Copy link
Contributor

@machine424: Jira Issue OCPBUGS-13153: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-13153 has been moved to the MODIFIED state.

In response to this:

xxxxx

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.
  • Add to jsonnet
  • Add tests
  • Adjust commit message

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@machine424
Copy link
Contributor Author

/cherrypick release-4.13
/cherrypick release-4.12

@openshift-cherrypick-robot

@machine424: #1996 failed to apply on top of branch "release-4.13":

Applying: OCPBUGS-13153: Limit the value of GOMAXPROCS on node-exporter to 4.
Using index info to reconstruct a base tree...
M	assets/node-exporter/daemonset.yaml
M	jsonnet/components/node-exporter.libsonnet
Falling back to patching base and 3-way merge...
Auto-merging jsonnet/components/node-exporter.libsonnet
Auto-merging assets/node-exporter/daemonset.yaml
CONFLICT (content): Merge conflict in assets/node-exporter/daemonset.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OCPBUGS-13153: Limit the value of GOMAXPROCS on node-exporter to 4.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-4.13
/cherrypick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants