If we use the `AddCaller(true)` option for a logger with
`AddStacktrace(level)`, which captures a stack trace for the specified
level or higher, Zap currently captures inspects the stack twice:
- `getCallerFrame` retrieves information about the immediate caller of
the log entry
- `StackSkip` calls `takeStacktrace` to capture the entire stack trace
and build a string from it
For the case where both caller and stack trace are requested,
we can avoid redundant work by sharing information about the call stack
between the logic for the two.
To accomplish this, the following high-level pieces were added:
type stacktrace
captureStacktrace(skip int, depth stacktraceDepth) *stacktrace
func (*stacktrace) Next() (runtime.Frame, bool)
type stackFormatter
func newStackFormatter(*buffer.Buffer)
func (*stackFormatter) FormatStack(*stacktrace)
func (*stackFormatter) FormatFrame(runtime.Frame)
`captureStacktrace` captures stack traces (either just one frame or the
entire stack) that can then be inspected manually (to extract caller
information) and formatted into a string representation either
one-frame-at-a-time or wholesale with `stackFormatter`.
This allows us reorganize logic that applies the AddCaller and
AddStacktrace options so that it can:
- capture as much of the stack as it needs
- extract caller information from the first frame if we need it
- format that frame and the remaining frames into a stack trace
---
Benchmarking:
I ran the included benchmark before and after the patch with the
following flags and compared the results with benchstat.
```
% go test -run '^$' -bench AddCallerAndStack -benchmem -count 10
[...]
% benchstat before.txt after.txt
name old time/op new time/op delta
AddCallerAndStacktrace-2 4.68µs ± 7% 3.63µs ± 7% -22.35% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
AddCallerAndStacktrace-2 632B ± 0% 416B ± 0% -34.18% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
AddCallerAndStacktrace-2 5.00 ± 0% 3.00 ± 0% -40.00% (p=0.000 n=10+10)
```
Allocations for this code path are down from 5 to 2, and CPU down by
~20%.
To check for regressions, I also ran all existing benchmarks with
"Caller" or "Stack" in their names:
```
% go test -run '^$' -bench 'Caller|Stack' -benchmem -count 10
[...]
% benchstat before.txt after.txt
name old time/op new time/op delta
StackField-2 3.28µs ± 2% 3.49µs ± 2% +6.38% (p=0.000 n=9+9)
AddCallerHook-2 1.89µs ± 2% 1.92µs ± 3% ~ (p=0.055 n=10+9)
TakeStacktrace-2 3.17µs ± 1% 3.60µs ± 8% +13.63% (p=0.000 n=8+9)
name old alloc/op new alloc/op delta
StackField-2 528B ± 0% 528B ± 0% ~ (all equal)
AddCallerHook-2 248B ± 0% 240B ± 0% -3.23% (p=0.000 n=10+10)
TakeStacktrace-2 416B ± 0% 416B ± 0% ~ (all equal)
name old allocs/op new allocs/op delta
StackField-2 3.00 ± 0% 3.00 ± 0% ~ (all equal)
AddCallerHook-2 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=10+10)
TakeStacktrace-2 2.00 ± 0% 2.00 ± 0% ~ (all equal)
```
AddCaller costs one less allocation, but the cost of capturing a stack
trace is *slightly* higher. I think that may be worth it for the 20%+
improvement on Caller+Stack and reduced allocations.