Optimize AddCaller+AddStacktrace by sharing stacks #1052

abhinav · 2022-01-25T01:57:20Z

(Supersedes #808.)

If we use the AddCaller(true) option for a logger with
AddStacktrace(level), which captures a stack trace for the specified
level or higher, Zap currently captures inspects the stack twice:

getCallerFrame retrieves information about the immediate caller of
the log entry
StackSkip calls takeStacktrace to capture the entire stack trace
and build a string from it

For the case where both caller and stack trace are requested,
we can avoid redundant work by sharing information about the call stack
between the logic for the two.

To accomplish this, the following high-level pieces were added:

type stacktrace
  captureStacktrace(skip int, depth stacktraceDepth) *stacktrace
  func (*stacktrace) Next() (runtime.Frame, bool)

type stackFormatter
  func newStackFormatter(*buffer.Buffer)
  func (*stackFormatter) FormatStack(*stacktrace)
  func (*stackFormatter) FormatFrame(runtime.Frame)

captureStacktrace captures stack traces (either just one frame or the
entire stack) that can then be inspected manually (to extract caller
information) and formatted into a string representation either
one-frame-at-a-time or wholesale with stackFormatter.

This allows us reorganize logic that applies the AddCaller and
AddStacktrace options so that it can:

capture as much of the stack as it needs
extract caller information from the first frame if we need it
format that frame and the remaining frames into a stack trace

Benchmarking:
I ran the new benchmark included in this PR, as well as the existing
stack trace/caller based benchmarks 10 times for 10 seconds each.

% go test -run '^$' -bench 'Caller|Stack' -benchmem -benchtime 10s -count 10
[...]
name                       old time/op    new time/op    delta
StackField-96                 834ns ± 8%     819ns ±14%     ~     (p=0.971 n=10+10)
AddCallerHook-96              466ns ± 7%     480ns ± 1%     ~     (p=0.118 n=10+6)
AddCallerAndStacktrace-96    1.16µs ± 4%    0.78µs ± 8%  -32.60%  (p=0.000 n=9+10)
TakeStacktrace-96            2.66µs ± 2%    2.74µs ± 2%   +3.01%  (p=0.000 n=9+10)

name                       old alloc/op   new alloc/op   delta
StackField-96                  542B ± 0%      544B ± 0%   +0.33%  (p=0.000 n=10+10)
AddCallerHook-96               254B ± 0%      248B ± 0%   -2.36%  (p=0.000 n=10+10)
AddCallerAndStacktrace-96      652B ± 0%      429B ± 0%  -34.20%  (p=0.000 n=9+6)
TakeStacktrace-96              420B ± 0%      421B ± 0%   +0.24%  (p=0.000 n=10+10)

name                       old allocs/op  new allocs/op  delta
StackField-96                  3.00 ± 0%      3.00 ± 0%     ~     (all equal)
AddCallerHook-96               3.00 ± 0%      2.00 ± 0%  -33.33%  (p=0.000 n=10+10)
AddCallerAndStacktrace-96      5.00 ± 0%      3.00 ± 0%  -40.00%  (p=0.000 n=10+10)
TakeStacktrace-96              2.00 ± 0%      2.00 ± 0%     ~     (all equal)

Summary:

AddCaller+AddStacktrace takes 2 fewer allocations and is ~30% faster
AddCaller costs one allocation less
takeStacktrace costs 3% more CPU

I think that the overall improvement is worth the slightly more expensive stack traces.

Refs GO-1181

Adds a benchmark that tests the performance of a Zap logger with both, `AddCaller` and `AddStacktrace` options enabled.

If we use the `AddCaller(true)` option for a logger with `AddStacktrace(level)`, which captures a stack trace for the specified level or higher, Zap currently captures inspects the stack twice: - `getCallerFrame` retrieves information about the immediate caller of the log entry - `StackSkip` calls `takeStacktrace` to capture the entire stack trace and build a string from it For the case where both caller and stack trace are requested, we can avoid redundant work by sharing information about the call stack between the logic for the two. To accomplish this, the following high-level pieces were added: type stacktrace captureStacktrace(skip int, depth stacktraceDepth) *stacktrace func (*stacktrace) Next() (runtime.Frame, bool) type stackFormatter func newStackFormatter(*buffer.Buffer) func (*stackFormatter) FormatStack(*stacktrace) func (*stackFormatter) FormatFrame(runtime.Frame) `captureStacktrace` captures stack traces (either just one frame or the entire stack) that can then be inspected manually (to extract caller information) and formatted into a string representation either one-frame-at-a-time or wholesale with `stackFormatter`. This allows us reorganize logic that applies the AddCaller and AddStacktrace options so that it can: - capture as much of the stack as it needs - extract caller information from the first frame if we need it - format that frame and the remaining frames into a stack trace --- Benchmarking: I ran the included benchmark before and after the patch with the following flags and compared the results with benchstat. ``` % go test -run '^$' -bench AddCallerAndStack -benchmem -count 10 [...] % benchstat before.txt after.txt name old time/op new time/op delta AddCallerAndStacktrace-2 4.68µs ± 7% 3.63µs ± 7% -22.35% (p=0.000 n=10+10) name old alloc/op new alloc/op delta AddCallerAndStacktrace-2 632B ± 0% 416B ± 0% -34.18% (p=0.000 n=10+10) name old allocs/op new allocs/op delta AddCallerAndStacktrace-2 5.00 ± 0% 3.00 ± 0% -40.00% (p=0.000 n=10+10) ``` Allocations for this code path are down from 5 to 2, and CPU down by ~20%. To check for regressions, I also ran all existing benchmarks with "Caller" or "Stack" in their names: ``` % go test -run '^$' -bench 'Caller|Stack' -benchmem -count 10 [...] % benchstat before.txt after.txt name old time/op new time/op delta StackField-2 3.28µs ± 2% 3.49µs ± 2% +6.38% (p=0.000 n=9+9) AddCallerHook-2 1.89µs ± 2% 1.92µs ± 3% ~ (p=0.055 n=10+9) TakeStacktrace-2 3.17µs ± 1% 3.60µs ± 8% +13.63% (p=0.000 n=8+9) name old alloc/op new alloc/op delta StackField-2 528B ± 0% 528B ± 0% ~ (all equal) AddCallerHook-2 248B ± 0% 240B ± 0% -3.23% (p=0.000 n=10+10) TakeStacktrace-2 416B ± 0% 416B ± 0% ~ (all equal) name old allocs/op new allocs/op delta StackField-2 3.00 ± 0% 3.00 ± 0% ~ (all equal) AddCallerHook-2 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=10+10) TakeStacktrace-2 2.00 ± 0% 2.00 ± 0% ~ (all equal) ``` AddCaller costs one less allocation, but the cost of capturing a stack trace is *slightly* higher. I think that may be worth it for the 20%+ improvement on Caller+Stack and reduced allocations.

codecov · 2022-01-25T01:58:55Z

Codecov Report

Merging #1052 (4de4fc5) into master (3f20106) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1052      +/-   ##
==========================================
+ Coverage   98.22%   98.29%   +0.07%     
==========================================
  Files          48       48              
  Lines        2080     2111      +31     
==========================================
+ Hits         2043     2075      +32     
+ Misses         29       28       -1     
  Partials        8        8

Impacted Files	Coverage Δ
logger.go	`96.52% <100.00%> (+0.29%)`	⬆️
stacktrace.go	`100.00% <100.00%> (+4.00%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f20106...4de4fc5. Read the comment docs.

prashantv

Change looks good, and it seems like not reusing CallersFrames actually helps bring performance back to master levels for TakeStacktrace,
#1053

prashantv · 2022-01-25T02:34:21Z

stacktrace.go

+	} else {
+		stack.pcs = stack.pcs[:numFrames]


do we need to branch here in the stackTraceFirst case, the loop will just be a no-op right? (or is there a performance difference)

Responded offline: Yes, we need the branch here for when numFrames=0

The `runtime.CallersFrames` method doesn't do much other than returning a struct pointer wrapping the passed in `pcs`. Calling this method twice ends up simplifying some of the logic (not having to format the first stack separately from the remaining), and ends up brining the performance of `TakeStacktrace` similar to `master`. ``` # benchstat comparing 3 runs of master vs this branch, each run using benchtime=10s name old time/op new time/op delta TakeStacktrace 1.47µs ± 2% 1.48µs ± 2% ~ (p=1.000 n=3+3) name old alloc/op new alloc/op delta TakeStacktrace 496B ± 0% 496B ± 0% ~ (all equal) name old allocs/op new allocs/op delta TakeStacktrace 2.00 ± 0% 2.00 ± 0% ~ (all equal) ```

abhinav · 2022-01-26T19:21:41Z

@prashantv I've updated the numbers above after incorporating #1053.
Are we sure that's better than the version with sharing of CallerFrames?
Here is how that performs on my work laptop:

name                      old time/op    new time/op    delta
StackField-8                 709ns ± 2%     722ns ± 1%   +1.80%  (p=0.002 n=10+9)
AddCallerHook-8              349ns ± 1%     359ns ± 1%   +2.72%  (p=0.000 n=10+9)
AddCallerAndStacktrace-8     872ns ± 1%     684ns ± 1%  -21.49%  (p=0.000 n=8+10)
TakeStacktrace-8            1.11µs ± 2%    1.21µs ± 0%   +8.83%  (p=0.000 n=10+10)

name                      old alloc/op   new alloc/op   delta
StackField-8                  561B ± 0%      561B ± 0%     ~     (all equal)
AddCallerHook-8               248B ± 0%      240B ± 0%   -3.23%  (p=0.000 n=10+10)
AddCallerAndStacktrace-8      665B ± 0%      449B ± 0%  -32.48%  (p=0.000 n=10+10)
TakeStacktrace-8              496B ± 0%      496B ± 0%     ~     (all equal)

name                      old allocs/op  new allocs/op  delta
StackField-8                  3.00 ± 0%      3.00 ± 0%     ~     (all equal)
AddCallerHook-8               3.00 ± 0%      2.00 ± 0%  -33.33%  (p=0.000 n=10+10)
AddCallerAndStacktrace-8      5.00 ± 0%      3.00 ± 0%  -40.00%  (p=0.000 n=10+10)
TakeStacktrace-8              2.00 ± 0%      2.00 ± 0%     ~     (all equal)

And on my dev server with -count 10 and -benchtime 10s

name                       old time/op    new time/op    delta
StackField-96                 834ns ± 8%     819ns ±14%     ~     (p=0.971 n=10+10)
AddCallerHook-96              466ns ± 7%     480ns ± 1%     ~     (p=0.118 n=10+6)
AddCallerAndStacktrace-96    1.16µs ± 4%    0.78µs ± 8%  -32.60%  (p=0.000 n=9+10)
TakeStacktrace-96            2.66µs ± 2%    2.74µs ± 2%   +3.01%  (p=0.000 n=9+10)

name                       old alloc/op   new alloc/op   delta
StackField-96                  542B ± 0%      544B ± 0%   +0.33%  (p=0.000 n=10+10)
AddCallerHook-96               254B ± 0%      248B ± 0%   -2.36%  (p=0.000 n=10+10)
AddCallerAndStacktrace-96      652B ± 0%      429B ± 0%  -34.20%  (p=0.000 n=9+6)
TakeStacktrace-96              420B ± 0%      421B ± 0%   +0.24%  (p=0.000 n=10+10)

name                       old allocs/op  new allocs/op  delta
StackField-96                  3.00 ± 0%      3.00 ± 0%     ~     (all equal)
AddCallerHook-96               3.00 ± 0%      2.00 ± 0%  -33.33%  (p=0.000 n=10+10)
AddCallerAndStacktrace-96      5.00 ± 0%      3.00 ± 0%  -40.00%  (p=0.000 n=10+10)
TakeStacktrace-96              2.00 ± 0%      2.00 ± 0%     ~     (all equal)

This reverts commit 9cbe14e.

abhinav · 2022-01-26T21:07:46Z

Reverting #1053. It does not appear to be an improvement after all.

sywhang

LGTM minus couple nits

logger.go

sywhang · 2022-01-26T22:14:32Z

stacktrace.go

-	pcs []uintptr
+// stackFormatter formats a stack trace into a readable string representation.
+type stackFormatter struct {
+	i int // number of frames already written


It looks like this can be a bool since we only use this for checking if there were already frames written or not in FormatFrame.

replacing with a bool

Co-authored-by: Sung Yoon Whang <sungyoon@uber.com>

Track whether we've written at least one frame, and use that to decide if we need the prefix.

abhinav added 2 commits January 24, 2022 17:53

Benchmark for AddCaller+AddStacktrace

648b2be

Adds a benchmark that tests the performance of a Zap logger with both, `AddCaller` and `AddStacktrace` options enabled.

Test takeStacktrace with a deep stack

b09d402

prashantv reviewed Jan 25, 2022

View reviewed changes

Revert "Avoid CallersFrames from escaping (#1053)"

440d598

This reverts commit 9cbe14e.

sywhang approved these changes Jan 27, 2022

View reviewed changes

prashantv approved these changes Jan 27, 2022

View reviewed changes

abhinav and others added 2 commits January 27, 2022 08:47

typo in logger.go

83f6340

Co-authored-by: Sung Yoon Whang <sungyoon@uber.com>

stackFormatter: Replace counter with a bool

4de4fc5

Track whether we've written at least one frame, and use that to decide if we need the prefix.

abhinav merged commit a1155d9 into master Jan 27, 2022

abhinav deleted the abg/share-stack-trace branch January 27, 2022 16:54

github-actions bot mentioned this pull request Apr 10, 2022

Link Checker Report scylladb/scylla-manager#3102

Closed

github-actions bot mentioned this pull request Apr 17, 2022

Link Checker Report scylladb/scylla-manager#3121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize AddCaller+AddStacktrace by sharing stacks #1052

Optimize AddCaller+AddStacktrace by sharing stacks #1052

abhinav commented Jan 25, 2022 •

edited

codecov bot commented Jan 25, 2022 •

edited

prashantv left a comment

prashantv Jan 25, 2022

abhinav Jan 26, 2022

abhinav commented Jan 26, 2022 •

edited

abhinav commented Jan 26, 2022

sywhang left a comment

sywhang Jan 26, 2022

abhinav Jan 27, 2022

Optimize AddCaller+AddStacktrace by sharing stacks #1052

Optimize AddCaller+AddStacktrace by sharing stacks #1052

Conversation

abhinav commented Jan 25, 2022 • edited

codecov bot commented Jan 25, 2022 • edited

Codecov Report

prashantv left a comment

Choose a reason for hiding this comment

prashantv Jan 25, 2022

Choose a reason for hiding this comment

abhinav Jan 26, 2022

Choose a reason for hiding this comment

abhinav commented Jan 26, 2022 • edited

abhinav commented Jan 26, 2022

sywhang left a comment

Choose a reason for hiding this comment

sywhang Jan 26, 2022

Choose a reason for hiding this comment

abhinav Jan 27, 2022

Choose a reason for hiding this comment

abhinav commented Jan 25, 2022 •

edited

codecov bot commented Jan 25, 2022 •

edited

abhinav commented Jan 26, 2022 •

edited