Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix metrics AlreadyRegisteredError on TestRecordOperation and TestGetHistogramVecFromGatherer unit test #106289

Merged
merged 9 commits into from
Nov 17, 2021
Merged

Fix metrics AlreadyRegisteredError on TestRecordOperation and TestGetHistogramVecFromGatherer unit test #106289

merged 9 commits into from
Nov 17, 2021

Conversation

CatherineF-dev
Copy link
Contributor

Test:

make test KUBE_RACE=-race KUBE_TIMEOUT=--timeout=600s GOFLAGS=-count=10 WHAT=./staging/src/k8s.io/component-base/metrics/testutil

make test KUBE_RACE=-race KUBE_TIMEOUT=--timeout=600s GOFLAGS=-count=10 WHAT=./pkg/kubelet/kuberuntime/

Fixes #104940

It takes over #105809

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 10, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @CatherineF-dev. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 10, 2021
@CatherineF-dev
Copy link
Contributor Author

cc @MikeSpreitzer

pkg/kubelet/kuberuntime/instrumented_services_test.go Outdated Show resolved Hide resolved
@@ -61,6 +69,8 @@ func TestRecordOperation(t *testing.T) {
assert.HTTPBodyContains(t, http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
mux.ServeHTTP(w, r)
}), "GET", prometheusURL, nil, runtimeOperationsDurationExpected)

registry.Reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary since the test will terminate after evaluating this expression so we will not be using the registry anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It failed with make test KUBE_RACE=-race KUBE_TIMEOUT=--timeout=600s GOFLAGS=-count=10 WHAT=./pkg/kubelet/kuberuntime/. The test is a little bit special, it runs with -count=10.

#104940 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it is run 10 times, all of the tests should have independent registries which shouldn't collide with one another.

A potential reason why you were still seeing failures might be because you are using the prometheus.DefaultRegisterer in the handler which is shared between the tests. Although the library might be protecting against that, I haven't checked. But it might be worth checking again with my suggestion from above: https://github.com/kubernetes/kubernetes/pull/106289/files#r746633125

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Damien, I think registry.Reset() is needed.

Even though registry is local, metrics RuntimeOperations and RuntimeOperationsDuration are global.


I tested that adding metrics.RuntimeOperations.Reset() would work ifregistry.Reset() was removed.
CatherineF-dev@a8692c6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, thank you for looking into that @CatherineF-dev 🙂

@wgahnagl wgahnagl added this to Triage in SIG Node PR Triage Nov 10, 2021
@wgahnagl
Copy link
Contributor

/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 10, 2021
@wgahnagl wgahnagl moved this from Triage to Needs Reviewer in SIG Node PR Triage Nov 10, 2021
@wgahnagl wgahnagl moved this from Needs Reviewer to Waiting on Author in SIG Node PR Triage Nov 10, 2021
@CatherineF-dev
Copy link
Contributor Author

/retest

@pacoxu
Copy link
Member

pacoxu commented Nov 15, 2021

/kind failing-test
/lgtm

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 15, 2021
@pacoxu pacoxu moved this from Waiting on Author to Needs Approver in SIG Node PR Triage Nov 15, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 15, 2021
// Use local registry
var registry = compbasemetrics.NewKubeRegistry()
var gather compbasemetrics.Gatherer = registry
defer registry.Reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to Reset at the end rather than the beginning? It seems to me that what this test func needs is for the count to be zero at the start, it does not care about the count at the end.

Copy link
Member

@pacoxu pacoxu Nov 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both(defer it to the end and reset at the begging) seem to be OK.
I prefer defer it to the end because we want to fix an issue when the test case runs multiple times.
The reset at the end of the test case will clear the metrics/env after running the case.
This is also a very clear way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the Reset at the end works if every test does a Reset at the end. Doing a Reset at the start works regardless of what other tests do. A local condition is better than a global one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both ways are okay.

I prefer defer. Because

  1. Other test files willn't affect this file. Because metrics RuntimeOperations is supposed to be tested in this file.
  2. It requires tests in this file doing Reset at the end. We have done it since metrics registration appears only once. Or, it could keep code style more consistent if metrics registration appears many times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both ways will work if, in a given process, no other function adds data to metrics.RuntimeOperations, metrics.RuntimeOperationsDuration, or metrics.RuntimeOperationsErrors before TestRecordOperation runs. That is a global condition. Note that metrics.RuntimeOperations is registered in legacyregistry in some other code invoked by another test. Other test files can affect this one, if the Reset is done at the end in this one. Local conditions are better than global ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both ways will work if, in a given process, no other function adds data to metrics.RuntimeOperations, metrics.RuntimeOperationsDuration, or metrics.RuntimeOperationsErrors before TestRecordOperation runs. That is a global condition. Note that metrics.RuntimeOperations is registered in legacyregistry in some other code invoked by another test. Other test files can affect this one, if the Reset is done at the end in this one. Local conditions are better than global ones.

I find this argument persuasive. Globals present numerous problems, but clearing the state before starting a test keeps the scope of control inside this Test instead of requiring every other test to "be perfect".

Copy link
Member

@aojea aojea Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clearing the state before starting a test keeps the scope of control inside this Test instead of requiring every other test to "be perfect".

yeah, we can't rely on other tests "to do the right thing", but we also need to clean up the state once the test ends, same as we do with the listeners, per example

defer l.Close()

I feel that we need both

diff --git a/pkg/kubelet/kuberuntime/instrumented_services_test.go b/pkg/kubelet/kuberuntime/instrumented_services_test.go
index e95d6bdb74a..1801f995d5f 100644
--- a/pkg/kubelet/kuberuntime/instrumented_services_test.go
+++ b/pkg/kubelet/kuberuntime/instrumented_services_test.go
@@ -37,6 +37,7 @@ func TestRecordOperation(t *testing.T) {
        registry.MustRegister(metrics.RuntimeOperations)
        registry.MustRegister(metrics.RuntimeOperationsDuration)
        registry.MustRegister(metrics.RuntimeOperationsErrors)
+ registry.Reset()
 
        l, err := net.Listen("tcp", "127.0.0.1:0")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not follow the analogy to calling Close. The Close method is about releasing expensive resources that an active connection holds. Reset is not analogous, it does not release expensive resources. I mean, there may be some internal side-effects of resetting some metrics, but it is nothing like open network connections that can accumulate and cause problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, my bad, I misinterpreted the reset on metrics.

Agree with you and David, the test has to clear the state before starting and not depend that other tests do the same after finishing

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2021
@CatherineF-dev
Copy link
Contributor Author

Thanks everyone! Have changed to Reset at the beginning.

@MikeSpreitzer
Copy link
Member

@CatherineF-dev : thank you for caring and seeing this through to completion!

Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2021
@deads2k
Copy link
Contributor

deads2k commented Nov 16, 2021

/approve

@CatherineF-dev
Copy link
Contributor Author

/retest

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@MikeSpreitzer
Copy link
Member

/assign @derekwaynecarr

@CatherineF-dev
Copy link
Contributor Author

/assign @derekwaynecarr

Thanks Mike!

@thockin
Copy link
Member

thockin commented Nov 16, 2021

Approving to allow others to rebase on it.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CatherineF-dev, deads2k, logicalhan, MikeSpreitzer, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 16, 2021
@k8s-ci-robot k8s-ci-robot merged commit 42d8b2f into kubernetes:master Nov 17, 2021
SIG Node CI/Test Board automation moved this from Triage to Done Nov 17, 2021
SIG Node PR Triage automation moved this from Needs Approver to Done Nov 17, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-none Denotes a PR that doesn't merit a release note. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

UT failure: panic: duplicate metrics collector registration attempted