Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dumping stacks during klog.Fatal #316

Closed
pohly opened this issue Mar 23, 2022 · 18 comments
Closed

dumping stacks during klog.Fatal #316

pohly opened this issue Mar 23, 2022 · 18 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@pohly
Copy link

pohly commented Mar 23, 2022

/kind bug

What steps did you take and what happened:

#79 changed the behavior of klog.Fatal so that it prints all goroutines when writing to stderr (the default). Depending on the program, this generates a huge amount of output which often is harmful (console scrollback buffer overlows, large log files in Kubernetes Prow jobs).

What did you expect to happen:

Only the current goroutine should be printed. This is the original behavior that several users of klog are expecting.

Anything else you would like to add:

A simple revert of #79 fixes this.

However, we can do better than that. In Kubernetes, we already recommend to replace klog.Fatal with a corresponding structured logging calls like klog.ErrorS plus klog.FlushAndExit. That replacement no longer dumps any stack trace. Often this is the right choice, for example after a lost leader election (one of the places where the program has to exit).

But sometimes, a developer might want to dump the current backtrace or all backtraces in the log entry. We could add a helper function for this:

// Backtrace returns an object that represents the stack backtrace of the calling
// goroutine. That object can be used as value in a structured logging call.
// It supports printing as string or as structured output via logr.MarshalLog.
func Backtrace() interface{}

// BacktraceDepth is like Backtrace except that it skips some stack levels.
// BacktraceDepth(0) is equivalent to Backtrace().
func BacktraceDepth(depth int) interface{}

// BacktraceAll is a variant of Backtrace that returns backtraces for all goroutines.
// Beware that this can produce a lot of output, depending on the number
// of goroutines in a program!
func BacktraceAll() interface{}

// BacktraceAllDepth is like BacktraceAll except that it skips some stack levels.
// BacktraceAllDepth(0) is equivalent to BacktraceAll().
func BacktraceAllDepth() interface{}
@pohly
Copy link
Author

pohly commented Mar 23, 2022

Whether structured output is possible remains to be seen. We also would need to decide how to structure it.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 23, 2022
@pchan
Copy link

pchan commented Mar 24, 2022

But sometimes, a developer might want to dump the current backtrace or all backtraces in the log entry. We could add a helper function for this:

// Backtrace returns an object that represents the stack backtrace of the calling
// goroutine. That object can be used as value in a structured logging call.
// It supports printing as string or as structured output via logr.MarshalLog.
func Backtrace() interface{}

I took a cursory look and found that we maybe missing a "line" version. This is present in the other APIs.

For example :
Infoln
Errorln

Do we need such a version for this too ?

@pohly
Copy link
Author

pohly commented Mar 24, 2022

The "line" variants are applicable to functions which do unstructured output formatting. What would be the effect here?

@pohly
Copy link
Author

pohly commented Mar 24, 2022

A better API is probably to use functional options:

func Backtrace(opts ...BacktraceOption) interface{}
func BacktraceSkip(depth int) BacktraceOption
func BacktraceSize(numFrames int) BacktraceOption
func BacktraceAll(all bool) BacktraceOption

This avoids the proliferation of different variants with different semantic and allows us to extend the API in the future.

@pchan
Copy link

pchan commented Mar 30, 2022

A better API is probably to use functional options:

func Backtrace(opts ...BacktraceOption) interface{}
func BacktraceSkip(depth int) BacktraceOption
func BacktraceSize(numFrames int) BacktraceOption
func BacktraceAll(all bool) BacktraceOption

This avoids the proliferation of different variants with different semantic and allows us to extend the API in the future.

I have a couple of questions.

  1. Does skip imply "shave off the top n frames" ?
  2. Will the size do the opposite i.e. return the top numFrames ?
  3. What does "all" bool parameter mean in BackTraceAll ? Is it a toggle for "All". The API itself implies "All" is on right ?

While looking at the implementation specifics, the BacktraceAll (Depth) presents a problem. The current BackTrace goroutine stack versions of depth/numframes size can be accomplished using runtime.callers API, which returns program counters of the calling goroutine stack. But I couldn't locate similar API (like runtime.callers) for all BacktraceAll (depth). The only API which returns all stack info is runtime.stacks(true) which doesn't have a depth option. We need to explore other API, package or roll our own text processing to keep things standardized if we need BacktraceAllDepth.

@pohly
Copy link
Author

pohly commented Mar 30, 2022

Does skip imply "shave off the top n frames" ?

With skip = 0, the backtrace starts at the direct caller. With skip = 1, one level is skipped, etc.

Will the size do the opposite i.e. return the top numFrames ?

No, it will limit how far unwinding goes. It will still start at the usual stack frame. For numFrames = 1, only a single stack frame gets recorded.

This will be useful for testing: the direct caller and (depending on the test code) a few levels above it are fixed, but go too far and we end up in the Go standard library with unknown source code locations. In production it may be useful to avoid excessively large stack backtraces.

What does "all" bool parameter mean in BackTraceAll ?

All goroutines, instead of just the current one. We can rename to "BacktraceAllGoroutines" to make this clearer.

@pchan
Copy link

pchan commented Mar 31, 2022 via email

@pohly
Copy link
Author

pohly commented Apr 1, 2022

BacktraceAllGoroutines has a parameter because is sometimes convenient to write code like this:

allGoroutines := <some expression>
klog.ErrorS(err, "fatal error", "callstack", klog.Backtrace(klog.BacktraceAllGoroutines(allGoroutines)))

This is open for debate. gRPC doesn't use a parameter (for example, WithDisableRetry). In zapr, I used boolean parameters and Tim was fine with that.

@pchan
Copy link

pchan commented Apr 1, 2022 via email

@pchan
Copy link

pchan commented May 7, 2022

I modified the BacktraceAll API signature to reflect the fact that it uses a different mechamism. Also the skip/size will not apply to it. I added a simple implementation as reference in #323 to take this forward.

func Backtrace(opts ...BacktraceOption) interface{}
func BacktraceSkip(depth int) BacktraceOption
func BacktraceSize(numFrames int) BacktraceOption
func BacktraceAll(all bool) interface {}

@pohly
Copy link
Author

pohly commented May 8, 2022

Now BacktraceAll(false) is identical to Backtrace(), right? I find this "more than one way of doing things" unnecessarily complex and prefer to have one function which gets parameterized. That it does different things under the hood is fine and not a reason to have different APIs where the return type is the same.

@pchan
Copy link

pchan commented May 9, 2022

Now BacktraceAll(false) is identical to Backtrace(), right? I find this "more than one way of doing things" unnecessarily complex and prefer to have one function which gets parameterized. That it does different things under the hood is fine and not a reason to have different APIs where the return type is the same.

One of the reasons why I chose this API setup is because "All" is incompatible with skip and size. So we now need to educate the users to not use "All" in conjunction with skip/size, define the behavior if they do so and implement validation if such a combination is used. The following should raise an error:

klog.Backtrace(klog.BacktraceSkip(1), klog.BacktraceAll(true))

Error: Can't use All parameter with Skip or Size`

I found it cleaner to communciate the difference to the user by virtue of the API itself, rather than using documentation and validation. If indeed, BacktraceAll(false) is used it will simply call Backtrace() in the background.

The only caveat is if we the "All" APIs can change to support skip/size in the future, but I am not sure if that makes sense at all. Anyways, I am fine with both approaches. If we decide to use BacktraceAll as an option, we need to decide if we should return an error, when the user exercises a wrong combination.

@pohly
Copy link
Author

pohly commented May 9, 2022

So we now need to educate the users to not use "All" in conjunction with skip/size, define the behavior if they do so and implement validation if such a combination is used.

We can simply define that skip/size don't have an effect when combined with All.

The only caveat is if we the "All" APIs can change to support skip/size in the future

It's not just that. BacktraceAll doesn't support any additional options, should we ever find a need for them. We would have to add BacktraceAllWithOptions.

I still prefer a single, extensible API function.

@pchan
Copy link

pchan commented May 9, 2022

We can simply define that skip/size don't have an effect when combined with All.

Do we need to issue a warning to the user that skip/size is ignored ? Maybe mention that in the head of the log message.

It's not just that. BacktraceAll doesn't support any additional options, should we ever find a need for them. We would have to add BacktraceAllWithOptions.

I still prefer a single, extensible API function.

Sure, I will make it a single API.

@pohly
Copy link
Author

pohly commented May 9, 2022

Do we need to issue a warning to the user that skip/size is ignored ? Maybe mention that in the head of the log message

I wouldn't mention it. I someone intentionally does it (for example, because "all" gets added based on some flag), then such a warning would just be noise. If it was unintentional, then it will be obvious from the output that they were skipped.

@pchan
Copy link

pchan commented May 11, 2022

I have updated the APIs to reflect the discussion. I will still need to add unit-tests, will be doing it in the next patch.

@pohly
Copy link
Author

pohly commented Jun 22, 2022

@k8s-ci-robot
Copy link

@pohly: Closing this issue.

In response to this:

/close

Resolved by:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants