Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goroutine leakage with TestFrameworkHandler_IterateOverWaitingPods #123707

Open
kerthcet opened this issue Mar 5, 2024 · 5 comments
Open

Goroutine leakage with TestFrameworkHandler_IterateOverWaitingPods #123707

kerthcet opened this issue Mar 5, 2024 · 5 comments
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@kerthcet
Copy link
Member

kerthcet commented Mar 5, 2024

Which jobs are failing?

Reproduce with go test -v -race ./pkg/scheduler -run TestFrameworkHandler_IterateOverWaitingPods -count=1

Copied from #123686 (comment)

There is still a chance of receiving the leaked error as following on my mac (even the test is succeed)

    --- PASS: TestFrameworkHandler_IterateOverWaitingPods/pods_with_different_profiles_are_waiting_on_permit_stage (0.10s)
E0305 15:13:34.760997   76753 framework.go:1500] "WARNING: test kept at least one goroutine running after test completion" logger="TestFrameworkHandler_IterateOverWaitingPods/pods_with_different_profiles_are_waiting_on_permit_stage leaked goroutine" pod="pod4" callstack=<
	goroutine 359 [running]:
	k8s.io/klog/v2/internal/dbg.Stacks(0x0)
		/Users/archelurong/.gvm/pkgsets/go1.20/global/pkg/mod/k8s.io/klog/v2@v2.120.1/internal/dbg/dbg.go:35 +0x8c
	k8s.io/klog/v2/ktesting.tlogger.fallbackLogger({0x140006b4480, {0x0, 0x0}, {0x140004a0880, 0x2, 0x2}})
		/Users/archelurong/.gvm/pkgsets/go1.20/global/pkg/mod/k8s.io/klog/v2@v2.120.1/ktesting/testinglogger.go:253 +0x1b4
	k8s.io/klog/v2/ktesting.tlogger.Info({0x140006b4480, {0x0, 0x0}, {0x140004a0880, 0x2, 0x2}}, 0x4, {0x1040a5591, 0x15}, {0x140007b9d60, ...})
		/Users/archelurong/.gvm/pkgsets/go1.20/global/pkg/mod/k8s.io/klog/v2@v2.120.1/ktesting/testinglogger.go:277 +0x11c
	github.com/go-logr/logr.Logger.Info({{0x104a7b8a8?, 0x14000694360?}, 0x0?}, {0x1040a5591, 0x15}, {0x140007b9d60, 0x2, 0x2})
		/Users/archelurong/.gvm/pkgsets/go1.20/global/pkg/mod/github.com/go-logr/logr@v1.4.1/logr.go:280 +0xe8
	k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).WaitOnPermit(0x14000292fc8, {0x104a75f20, 0x140007ac280}, 0x14000406d88)
		/Users/archelurong/go/src/kubernetes/pkg/scheduler/framework/runtime/framework.go:1500 +0x1a8
	k8s.io/kubernetes/pkg/scheduler.(*Scheduler).bindingCycle(0x140005a38c0, {0x104a75f20, 0x140007ac280}, 0x14000992180, {0x104a9a9c0, 0x14000292fc8}, {{0x104089a73, 0x5}, 0x1, 0x1, ...}, ...)
		/Users/archelurong/go/src/kubernetes/pkg/scheduler/schedule_one.go:278 +0xa8
	k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne.func1()
		/Users/archelurong/go/src/kubernetes/pkg/scheduler/schedule_one.go:125 +0x194
	created by k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne in goroutine 270
		/Users/archelurong/go/src/kubernetes/pkg/scheduler/schedule_one.go:118 +0x620
 >
PASS

Pod will wait in the binding cycle from its own goroutines, is it possible that these goroutines may leak?
So at the end of the testing, I allowed these pods to pass the permit stage (using the code below), but I also got new errors

		fwk1, ok := scheduler.Profiles[testSchedulerProfile1]
		if !ok {
			t.Fatalf("unable get profile")
		}
		fwk1.IterateOverWaitingPods(func(pod framework.WaitingPod) {
			pod.Allow(fakePermit)
		})

		wg.Add(waitSchedulingPodNumber)
		stopFn2, err := eventBroadcaster.StartEventWatcher(func(obj runtime.Object) {
			e, ok := obj.(*eventsv1.Event)
			if !ok || (e.Reason != "Scheduled") {
				return
			}
			wg.Done()
		})
		if err != nil {
			t.Fatal(err)
		}
		defer stopFn2()

		wg.Wait()

new leak error as following....

=== RUN   TestFrameworkHandler_IterateOverWaitingPods/pods_with_different_profiles_are_waiting_on_permit_stage
E0305 15:10:07.033132   74647 scheduling_queue.go:890] "WARNING: test kept at least one goroutine running after test completion" logger="TestFrameworkHandler_IterateOverWaitingPods/pods_with_same_profile_are_waiting_on_permit_stage leaked goroutine" callstack=<
	goroutine 290 [running]:
	k8s.io/klog/v2/internal/dbg.Stacks(0x0)
		/Users/archelurong/.gvm/pkgsets/go1.20/global/pkg/mod/k8s.io/klog/v2@v2.120.1/internal/dbg/dbg.go:35 +0x8c
	k8s.io/klog/v2/ktesting.tlogger.fallbackLogger({0x140007947e0, {0x0, 0x0}, {0x0, 0x0, 0x0}})
		/Users/archelurong/.gvm/pkgsets/go1.20/global/pkg/mod/k8s.io/klog/v2@v2.120.1/ktesting/testinglogger.go:253 +0x1b4
	k8s.io/klog/v2/ktesting.tlogger.Info({0x140007947e0, {0x0, 0x0}, {0x0, 0x0, 0x0}}, 0x2, {0x103ecea58, 0x1a}, {0x0, ...})
		/Users/archelurong/.gvm/pkgsets/go1.20/global/pkg/mod/k8s.io/klog/v2@v2.120.1/ktesting/testinglogger.go:277 +0x11c
	github.com/go-logr/logr.Logger.Info({{0x10489b8a8?, 0x140002eab70?}, 0x14000b219d8?}, {0x103ecea58, 0x1a}, {0x0, 0x0, 0x0})
		/Users/archelurong/.gvm/pkgsets/go1.20/global/pkg/mod/github.com/go-logr/logr@v1.4.1/logr.go:280 +0xe8
	k8s.io/kubernetes/pkg/scheduler/internal/queue.(*PriorityQueue).Pop(0x14000674000, {{0x10489b8a8?, 0x140002eab70?}, 0x140003da050?})
		/Users/archelurong/go/src/kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go:890 +0xec
	k8s.io/kubernetes/pkg/scheduler.(*Scheduler).scheduleOne(0x140006d6180, {0x104895f20, 0x140003da050})
		/Users/archelurong/go/src/kubernetes/pkg/scheduler/schedule_one.go:68 +0x5c
	k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
		/Users/archelurong/go/src/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x2c
	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x14000b21ec8?)
		/Users/archelurong/go/src/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x40
	k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x14000b21f68, {0x104870a48, 0x14000a98420}, 0x1, 0x14000176900)
		/Users/archelurong/go/src/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0x90
	k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x14000a8c768, 0x0, 0x0, 0x1, 0x14000176900)
		/Users/archelurong/go/src/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x80
	k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x104895f20, 0x140003da050}, 0x1400061c1f0, 0x0, 0x0, 0x1)
		/Users/archelurong/go/src/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x80
	k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...)
		/Users/archelurong/go/src/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/backoff.go:170
	created by k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run in goroutine 218
		/Users/archelurong/go/src/kubernetes/pkg/scheduler/scheduler.go:418 +0x104
 >

Which tests are failing?

None

Since when has it been failing?

None

Testgrid link

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Relevant SIG(s)

/sig scheduling
cc @NoicFank

@kerthcet kerthcet added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Mar 5, 2024
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 5, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kerthcet
Copy link
Member Author

kerthcet commented Mar 5, 2024

/remove-kind failing-test
/kind cleanup

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Mar 5, 2024
@aojea
Copy link
Member

aojea commented Mar 5, 2024

/kind flake
/cc @pohly

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Mar 5, 2024
@pohly
Copy link
Contributor

pohly commented Mar 5, 2024

WARNING: test kept at least one goroutine running after test completion

This is pretty normal in Kubernetes because very few of our objects support waiting for completion of their goroutines. I probably need to tone down this warning a bit further to avoid confusion:

  • Don't log it as error.
  • Only print the warning if the log message that triggers it gets emitted long after test completion - 10 seconds?

@pohly
Copy link
Contributor

pohly commented Mar 11, 2024

See kubernetes/klog#401

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

4 participants