Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ktesting: tone down warning about leaked test goroutine #401

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
28 changes: 24 additions & 4 deletions ktesting/testinglogger.go
Expand Up @@ -217,8 +217,19 @@ type tloggerShared struct {
// t gets cleared when the test is completed.
t TL

// We warn once when a leaked goroutine is detected because
// it logs after test completion.
// The time when the test completed.
stopTime time.Time

// We warn once when a leaked goroutine logs after test completion.
//
// Not terminating immediately is fairly normal: many controllers
// log "terminating" messages while they shut down, which often is
// right after test completion, if that is when the test cancels the
// context and then doesn't wait for goroutines (which is often
// not possible).
//
// Therefore there is the [stopGracePeriod] during which messages get
// logged to the global logger without the warning.
goroutineWarningDone bool

formatter serialize.Formatter
Expand All @@ -228,10 +239,15 @@ type tloggerShared struct {
callDepth int
}

// Log output of a leaked goroutine during this period after test completion
// does not trigger the warning.
const stopGracePeriod = 10 * time.Second

func (ls *tloggerShared) stop() {
ls.mutex.Lock()
defer ls.mutex.Unlock()
ls.t = nil
ls.stopTime = time.Now()
}

// tlogger is the actual LogSink implementation.
Expand All @@ -250,8 +266,12 @@ func (l tlogger) fallbackLogger() logr.Logger {
logger = logger.WithCallDepth(l.shared.callDepth + 1)

if !l.shared.goroutineWarningDone {
logger.WithCallDepth(1).Error(nil, "WARNING: test kept at least one goroutine running after test completion", "callstack", string(dbg.Stacks(false)))
l.shared.goroutineWarningDone = true
duration := time.Since(l.shared.stopTime)
if duration > stopGracePeriod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems will bury the real goroutine leakage if the duration is less than stopGracePeriod, do I understand right here?

Another point I can't get here is if the test completes, why will it wait for another 10s?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the "leaked" goroutine stops after less than stopGracePeriod, then it wasn't really a leak. It was "just" a test that didn't block and wait for goroutines at the end. This is very common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally misunderstood the meaning here. I'm not quite understand how it works but I still have a silly question like: Will we wait and detect the goroutine leakage after test completes? Or why the duration from stop will be greater than stopGracePeriod.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't wait. That is something that uber/goleak does, but not this code. This code is merely reporting on leaked goroutines if and when it detects them, which is when they keep logging.

The stopGracePeriod is the grace period that goroutines have before their log output triggers this warning. I can rename it if you have a better suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recalled goleak has maxRetries, then can't we just configure the goleak? It's configurable now. Both the retry times and sleep duration. I think somehow the same as stopGracePeriod. Will this lead to other problems?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is klog. We cannot depend on goleak here, nor do I want to re-implement it.

The reason why this warning exists is because the per-test logger has to do something. It cannot call testing.T.Log after the test that testing.T belongs to has terminated, it would crash - I know, because that has happened initially.

When it detects that a test has stopped, it handles output differently and writes to stderr. Issuing this warning seemed usedful, but printing it immediately has confused Kubernetes developers because the goroutine hasn't really leaked. This PR is an attempt to keep the warning for goroutines that are more likely to be real leaks while not printing it for goroutines which terminate quickly after test completion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the warning for goroutines that are more likely to be real leaks while not printing it for goroutines which terminate quickly after test completion.

Definitely agree.

We cannot depend on goleak here

Yes for klog... I mean for callers, like test framework. I suggested in this way is just because the goroutine leakage detection duration is mostly decided by goleak, the maxRetries(default to 20), maxSleep(default to 100ms) specifically, the duration ~= maxRetries * (maxSleep + filterStacks), have no idea whether 10s fits here. However, the feature is reverted... uber-go/goleak#116, so no way to configure now.

As a user of klog, happy with either way to avoid confusion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree.

So LGTM?

I mean for callers, like test framework.

Leak detection with goleak only works when tests run sequentially or in TestMain. I'm not sure how useful it is in an E2E test suite, because the actual code under test runs elsewhere. For integration and unit tests, individual test owners have to integrate goleak. Some apiserver-based tests have it.

have no idea whether 10s fits here

So this is about how klog will interact with goleak. I don't have a strong opinion here. I picked 10 seconds because it should be long enough to no produce "false positives" for the "I am shutting down after test completion" goroutines and short enough to give feedback for others. Because it's not causing a test failure, it can be lower than the goleak timeouts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm


logger.WithCallDepth(1).Info("WARNING: test kept at least one goroutine running after test completion", "timeSinceCompletion", duration.Round(time.Second), "callstack", string(dbg.Stacks(false)))
l.shared.goroutineWarningDone = true
}
}
return logger
}
Expand Down
15 changes: 10 additions & 5 deletions ktesting/testinglogger_test.go
Expand Up @@ -16,6 +16,7 @@ import (
"runtime"
"sync"
"testing"
"time"

"k8s.io/klog/v2"
"k8s.io/klog/v2/internal/test/require"
Expand Down Expand Up @@ -198,6 +199,7 @@ func TestStop(t *testing.T) {
_, _, line, _ = runtime.Caller(0)
logger.Info("simple info message")
logger.Error(nil, "error message")
time.Sleep(15 * time.Second)
logger.WithName("me").WithValues("completed", true).Info("complex info message", "anotherValue", 1)
}()
})
Expand All @@ -212,11 +214,14 @@ func TestStop(t *testing.T) {
// Strip time and pid prefix.
actual = regexp.MustCompile(`(?m)^.* testinglogger_test.go:`).ReplaceAllString(actual, `testinglogger_test.go:`)

// Strip duration.
actual = regexp.MustCompile(`timeSinceCompletion="\d+s"`).ReplaceAllString(actual, `timeSinceCompletion="<...>s"`)

// All lines from the callstack get stripped. We can be sure that it was non-empty because otherwise we wouldn't
// have the < > markers.
//
// Full output:
// testinglogger_test.go:194] "WARNING: test kept at least one goroutine running after test completion" logger="TestStop/Sub leaked goroutine" callstack=<
// testinglogger_test.go:194] "WARNING: test kept at least one goroutine running after test completion" logger="TestStop/Sub leaked goroutine.me" completed=true timeSinceCompletion="15s" callstack=<
// goroutine 23 [running]:
// k8s.io/klog/v2/internal/dbg.Stacks(0x0)
// /nvme/gopath/src/k8s.io/klog/internal/dbg/dbg.go:34 +0x8a
Expand All @@ -233,13 +238,13 @@ func TestStop(t *testing.T) {
// >
actual = regexp.MustCompile(`(?m)^\t.*?\n`).ReplaceAllString(actual, ``)

expected := fmt.Sprintf(`testinglogger_test.go:%d] "WARNING: test kept at least one goroutine running after test completion" logger="TestStop/Sub leaked goroutine" callstack=<
>
testinglogger_test.go:%d] "simple info message" logger="TestStop/Sub leaked goroutine"
expected := fmt.Sprintf(`testinglogger_test.go:%d] "simple info message" logger="TestStop/Sub leaked goroutine"
testinglogger_test.go:%d] "error message" logger="TestStop/Sub leaked goroutine"
testinglogger_test.go:%d] "WARNING: test kept at least one goroutine running after test completion" logger="TestStop/Sub leaked goroutine.me" completed=true timeSinceCompletion="<...>s" callstack=<
>
testinglogger_test.go:%d] "complex info message" logger="TestStop/Sub leaked goroutine.me" completed=true anotherValue=1
`,
line+1, line+1, line+2, line+3)
line+1, line+2, line+4, line+4)
if actual != expected {
t.Errorf("Output does not match. Expected:\n%s\nActual:\n%s\n", expected, actual)
}
Expand Down