Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Klog panic lead to controller WaitForNamedCacheSync never returned #106534

Closed
astraw99 opened this issue Nov 18, 2021 · 8 comments
Closed

Klog panic lead to controller WaitForNamedCacheSync never returned #106534

astraw99 opened this issue Nov 18, 2021 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation.

Comments

@astraw99
Copy link
Member

astraw99 commented Nov 18, 2021

What happened?

In our K8s cluster, we provided the klog.SetLogger(selfLogger), then observed our controller pod running, but its log is Waiting for caches to sync for..., and never returned.
After debug, we found that it is the klog Error(impl by our logger) func panic, lead the logr mutex waiting to unlock.
https://github.com/kubernetes/kubernetes/blob/b8af116327cd5d8e5411cbac04e7d4d11d22485d/vendor/k8s.io/klog/v2/klog.go#L931

Go pprof debug indicate the logger lock is waiting to unlock:

goroutine 142 [semacquire, 15 minutes]:                                                                                       
sync.runtime_SemacquireMutex(0x2d7c2dc, 0x0, 0x1)                                                                                     
        runtime/sema.go:71 +0x47                                                                                                      
sync.(*Mutex).lockSlow(0x2d7c2d8)                                                                            
        sync/mutex.go:138 +0x105                                                                                                      
sync.(*Mutex).Lock(...)                                                                                                
        sync/mutex.go:81                                                                                                              
k8s.io/klog/v2.(*loggingT).output(0x2d7c2c0, 0xc000000000, 0x20a5b00, 0xc000103f08, 0xc0003e02a0, 0x24aceff, 0x12, 0xf7, 0x0)
        k8s.io/klog/v2@v2.4.0/klog.go:908 +0x870                                                                       
k8s.io/klog/v2.(*loggingT).printf(0x2d7c2c0, 0xc000000000, 0x20a5b00, 0xc000103f08, 0x0, 0x0, 0x1e104ba, 0x19, 0xc0005c4360, 0x1, ...)
        k8s.io/klog/v2@v2.4.0/klog.go:750 +0x191                                                                       
k8s.io/klog/v2.Infof(...)                                                                             
        k8s.io/klog/v2@v2.4.0/klog.go:1402                                                                   
k8s.io/client-go/tools/cache.WaitForNamedCacheSync(0xc0000512c0, 0x4a, 0xc0000a6660, 0xc0005de760, 0x1, 0x1, 0x1dfc329)
        k8s.io/client-go@v11.0.0+incompatible/tools/cache/shared_informer.go:247 +0x206                                      
k8s.io/apiserver/pkg/server/dynamiccertificates.(*ConfigMapCAController).Run(0xc0004c8dc0, 0x1, 0xc0000a6660)
        k8s.io/apiserver@v0.19.8/pkg/server/dynamiccertificates/configmap_cafile_content.go:209 +0x2b4                                
created by k8s.io/apiserver/pkg/server/dynamiccertificates.unionCAContent.Run            
        k8s.io/apiserver@v0.19.8/pkg/server/dynamiccertificates/union_content.go:104 +0xcb

What did you expect to happen?

The logger should handle panic, and release the lock instantly.

How can we reproduce it (as minimally and precisely as possible)?

Provide the klog.SetLogger(selfLogger), but with the Error func return a panic, then will observe the issue described above.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.4-tke.3", GitCommit:"f8797eb035aca9c928af02ea9a74d232cfd41bc6", GitTreeState:"clean", BuildDate:"2020-10-12T03:32:01Z", GoVersion:"go1.14.8", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.18) exceeds the supported minor version skew of +/-1
@astraw99 astraw99 added the kind/bug Categorizes issue or PR as related to a bug. label Nov 18, 2021
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 18, 2021
@k8s-ci-robot
Copy link
Contributor

@astraw99: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 18, 2021
@BenTheElder
Copy link
Member

This issue probably belongs in https://github.com/kubernetes/klog/ instead

@astraw99
Copy link
Member Author

This issue probably belongs in https://github.com/kubernetes/klog/ instead

Got it, new PR: kubernetes/klog#272

@pohly
Copy link
Contributor

pohly commented Nov 19, 2021

After debug, we found that it is the klog Error(impl by our logger) func panic, lead the logr mutex waiting to unlock.

If the logger panicked, why did that not kill the process?

IMHO a logger implementation should never panic and if it does, that's so fatal that the code using it doesn't need to handle that gracefully.

@astraw99
Copy link
Member Author

After debug, we found that it is the klog Error(impl by our logger) func panic, lead the logr mutex waiting to unlock.

If the logger panicked, why did that not kill the process?

IMHO a logger implementation should never panic and if it does, that's so fatal that the code using it doesn't need to handle that gracefully.

Yes, updated with aborting the process.
Problem here is that the goroutine is waiting to unlock, so we need to unlock it safely.

@neolit123
Copy link
Member

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.4-tke.3", GitCommit:"f8797eb035aca9c928af02ea9a74d232cfd41bc6", GitTreeState:"clean", BuildDate:"2020-10-12T03:32:01Z", GoVersion:"go1.14.8", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.18) exceeds the supported minor version skew of +/-1

side note, as the warning above says this is unsupported skew.

/sig instrumentation

@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 22, 2021
@ehashman
Copy link
Member

ehashman commented Dec 2, 2021

Server version is unsupported (1.18) and this seems like a klog bug rather than Kubernetes bug as this is a custom controller. Closing.

/close

@k8s-ci-robot
Copy link
Contributor

@ehashman: Closing this issue.

In response to this:

Server version is unsupported (1.18) and this seems like a klog bug rather than Kubernetes bug as this is a custom controller. Closing.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants