Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-controller-manager yields whole stack trace when it loses leaderelection #107665

Closed
ialidzhikov opened this issue Jan 20, 2022 · 10 comments · Fixed by #107724
Closed

kube-controller-manager yields whole stack trace when it loses leaderelection #107665

ialidzhikov opened this issue Jan 20, 2022 · 10 comments · Fixed by #107724
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ialidzhikov
Copy link
Contributor

What happened?

kube-controller-manager yields whole stack trace when it loses leaderelection.

What did you expect to happen?

kube-controller-manager to only log leaderelection lost and to exit (without logging 17K lines of stack trace that is not useful for anyone).

How can we reproduce it (as minimally and precisely as possible)?

E0117 14:49:11.977837       1 leaderelection.go:361] Failed to update lock: resource name may not be empty
I0117 14:49:11.977913       1 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F0117 14:49:11.978122       1 controllermanager.go:293] leaderelection lost
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc000130001, 0xc03a3fef20, 0x4c, 0x143)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x6a66080, 0xc000000003, 0x0, 0x0, 0xc03b46c7e0, 0x68dd348, 0x14, 0x125, 0x0)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:945 +0x191
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printf(0x6a66080, 0x3, 0x0, 0x0, 0x445b035, 0x13, 0x0, 0x0, 0x0)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:733 +0x17a
k8s.io/kubernetes/vendor/k8s.io/klog/v2.Fatalf(...)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1456
k8s.io/kubernetes/cmd/kube-controller-manager/app.Run.func2()
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-controller-manager/app/controllermanager.go:293 +0x73
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc00032b8c0)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0x29
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc00032b8c0, 0x4a824a0, 0xc000499100)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:210 +0x15d
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x4a824e0, 0xc00012e018, 0x4ab8720, 0xc001352f80, 0x37e11d600, 0x2540be400, 0x77359400, 0xc001352f60, 0x45c4378, 0x0, ...)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:222 +0x9c
k8s.io/kubernetes/cmd/kube-controller-manager/app.Run(0xc000131528, 0xc0001140c0, 0xc00054b850, 0xc00018c530)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-controller-manager/app/controllermanager.go:285 +0x979
k8s.io/kubernetes/cmd/kube-controller-manager/app.NewControllerManagerCommand.func2(0xc00004cb00, 0xc000d39680, 0x0, 0x28)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-controller-manager/app/controllermanager.go:124 +0x2b7
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc00004cb00, 0xc000136010, 0x28, 0x2b, 0xc00004cb00, 0xc000136010)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:846 +0x2c2
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00004cb00, 0x16cb1631aab94aa9, 0x6a65ce0, 0x406525)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:950 +0x375
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:887
main.main()
	_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-controller-manager/controller-manager.go:46 +0xe5

goroutine 18 [chan receive]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).flushDaemon(0x6a66080)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1131 +0x8b
created by k8s.io/kubernetes/vendor/k8s.io/klog/v2.init.0
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:416 +0xd8

goroutine 107 [chan receive]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates.(*ConfigMapCAController).Run(0xc0007394a0, 0x1, 0xc0001140c0)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:222 +0x365
created by k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates.unionCAContent.Run
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/union_content.go:104 +0xcb

goroutine 126 [select]:
k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0xc0006623c0)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:231 +0x405
created by k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue.newDelayingQueue
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:68 +0x185

goroutine 130 [select]:
k8s.io/kubernetes/vendor/go.opencensus.io/stats/view.(*worker).start(0xc000354f00)
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/go.opencensus.io/stats/view/worker.go:154 +0x105
created by k8s.io/kubernetes/vendor/go.opencensus.io/stats/view.init.0
	/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/go.opencensus.io/stats/view/worker.go:32 +0x57

kube-controller-manager yields around 17K lines of stack trace (omitted in the above example).

Anything else we need to know?

No response

Kubernetes version

v1.20.12

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@ialidzhikov ialidzhikov added the kind/bug Categorizes issue or PR as related to a bug. label Jan 20, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 20, 2022
@neolit123
Copy link
Member

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 20, 2022
@fedebongio
Copy link
Contributor

/assign @jiahuif @cheftako
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 20, 2022
@liggitt
Copy link
Member

liggitt commented Jan 22, 2022

has Fatalf always triggered a stack dump of all goroutines?

@kkkkun
Copy link
Member

kkkkun commented Jan 23, 2022

/cc

@MadhavJivrajani
Copy link
Contributor

MadhavJivrajani commented Jan 24, 2022

@liggitt
Seems like klog.Fatalf(...) does a dump of all goroutines

// Fatalf logs to the FATAL, ERROR, WARNING, and INFO logs,
// including a stack trace of all running goroutines, then calls os.Exit(255).
// Arguments are handled in the manner of fmt.Printf; a newline is appended if missing.
func Fatalf(format string, args ...interface{}) {
logging.printf(fatalLog, logging.logr, logging.filter, format, args...)
}

Could we klog.ErrorS(...) and then os.Exit(255) maybe?
/cc @pohly in case you have thoughts about this

@pohly
Copy link
Contributor

pohly commented Jan 24, 2022

has Fatalf always triggered a stack dump of all goroutines?

If I read the original code right, it only printed the current goroutine to stderr. Dumping everything seems to have been added later.

ErrorS + os.Exit make sense to me.

@pohly
Copy link
Contributor

pohly commented Jan 24, 2022

Dumping of all goroutines was added in kubernetes/klog#79.

@kkkkun
Copy link
Member

kkkkun commented Jan 24, 2022

I have similar doubts. I want to fix it.
/assign

@mohag
Copy link

mohag commented Mar 24, 2022

kube-scheduler seems to have a similar issue (but a shorter stack trace)

@pohly
Copy link
Contributor

pohly commented Mar 24, 2022

When replacing klog.Fatal (useful because it doesn't do structured logging, not just because of the stack trace problem), remember to use klog.Error + the new klog.FlushAndExit. Updated migration instructions are pending in kubernetes/community#6560

There's also kubernetes/klog#316 - reverting the klog change might be a faster way to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.