Add Unmarshal perf API to read without allocation #650

brycekahle · 2022-05-01T22:21:44Z

before:

goos: linux
goarch: arm64
pkg: github.com/cilium/ebpf/perf
BenchmarkReader
BenchmarkReader-4   	  293677	      3626 ns/op	     409 B/op	       6 allocs/op

after:

goos: linux
goarch: arm64
pkg: github.com/cilium/ebpf/perf
BenchmarkReader
BenchmarkReader-4              	  311409	      3507 ns/op	     384 B/op	       2 allocs/op
BenchmarkUnmarshalSlice
BenchmarkUnmarshalSlice-4      	  310951	      3667 ns/op	     288 B/op	       1 allocs/op
BenchmarkUnmarshalerSimple
BenchmarkUnmarshalerSimple-4   	  304730	      3669 ns/op	     288 B/op	       1 allocs/op

The one alloc is from prog.Test and only present during the benchmark.

ti-mo · 2022-05-02T10:22:24Z

Hey Bryce, thanks for giving this another spin. I'm a bit concerned about the bigger picture here. I see 210 extra lines of code to maintain for a 0.3% execution speed improvement, with total alloc size improvements in the same order of magnitude. Golfing total allocs/op should not be the goal, I think we need something more impactiful than this to consider it worth it.

A few thoughts:

Using a sync.Pool for storing integers seems rather counterproductive. I suspect this undoes the gains gotten by adding custom unmarshalers.
Eliminating calls to binary.Read in favour of implementing BinaryUnmarshalers should result in execution speed gains, but only noticeably on complex types, since those require reflection. binary.Read is relatively fast on builtins, so implementing unmarshalers for those might not be worth the LoC. Ideally, this should be benchmarked this in isolation to make sure.
Could the benchmarks and changes to tests be split off into separate commits?

brycekahle · 2022-05-02T16:29:54Z

Golfing total allocs/op should not be the goal

It might be hard to show the gains from this at a library level. If you want, I could incorporate this change into our agent and put it through our load testing environment.

with total alloc size improvements in the same order of magnitude.

BenchmarkReader is maintaining the previous behavior, which still does a copy and alloc in this library. Using Unmarshal results in 0 allocs/op from the library itself, which is a gain of 96 bytes/op and 5 allocs/op.

The main benefit is that the user of the library has total control of the allocation behavior.

Using a sync.Pool for storing integers seems rather counterproductive. I suspect this undoes the gains gotten by adding custom unmarshalers.

I'm using the sync.Pool for the slice that is necessary to read the uint32. This is because it can span the end of the circular buffer, so there is not an easy stack-based way to read the value (my other PR did this, but didn't account for the end of the buffer).

Eliminating calls to binary.Read in favour of implementing BinaryUnmarshalers should result in execution speed gains

The goal here was reducing allocs, not necessarily speed gains.

Could the benchmarks and changes to tests be split off into separate commits?

Sure!

brycekahle · 2022-05-03T17:19:07Z

If you want, I could incorporate this change into our agent and put it through our load testing environment.

I went ahead and did this. By giving us control of allocations, we are able to save >15% of our RSS/working set memory.

lmb · 2022-05-03T18:40:35Z

I agree with Timo that the change is pretty intrusive for what is superficially a small gain. Seems like my idea wasn't particularly good, sorry. I have some questions to help me understand what's going on:

What does your load testing look like? Is it "max throughput" with given resources, or does it simulate a prod deployment?
Is there a way for us to reproduce your load testing?
At which point did you take RSS, some steady state? Did you average it or something similar?
Are you seeing CPU usage go down as well? Any changes to GC pause metrics, or how frequently the GC runs?
Do you tweak GOGC or the like?

brycekahle · 2022-05-03T21:41:09Z

Because we are processing data on a continuous basis, excess allocations on the hot path cause the heap size to increase between GC runs. This causes the GC to increase the NextGC value, and ultimately lead to a higher steady state RSS of our service. Reducing allocations has a significant impact, and we have already seen improvements from other strategies like string interning, using the new stack-based IP address type netaddr.IP, using sync.Pools, and others.

The allocations from the perf Read are not the biggest slice of the pie, but right now we have no ability to control the allocation behavior (without forking, something we'd like to avoid, if possible).

What does your load testing look like? Is it "max throughput" with given resources, or does it simulate a prod deployment?

It is a continuous environment where we are sending 3000 HTTP requests/sec at a host. That host is running multiple versions of our ebpf-based agent, so we can compare apples to apples with regard to the network traffic that they see.

Is there a way for us to reproduce your load testing?

It would be difficult.

At which point did you take RSS, some steady state? Did you average it or something similar?

We measure it every 20s, and plot it. It is not a flat line, but seems to oscillate around a value. We run the experiment for a long time (> 1 hour) to make sure spurious behaviors do not influence the analysis.

Are you seeing CPU usage go down as well?

No significant changes to CPU for us.

Any changes to GC pause metrics, or how frequently the GC runs?

I can take a look, but GC pauses generally aren't the problem for us. This isn't a web server where a GC pause will cause the 99th percentile latency to spike.

Do you tweak GOGC or the like?

We haven't yet, but this is something we will likely do very soon.

lmb · 2022-05-04T18:43:27Z

I took another look, I think my idea to use an "unmarshal into interface{}" API was bad. Sorry for sending you on a wild goose chase. Another thing I noticed is that the benchmark is dominated by submitting samples into the perf ring. This explains why the gains look small, when in reality they are really quite sizeable for the work we actually care about.

Can you please give https://github.com/lmb/ebpf/tree/reader-read-buffer a try in your test environment and see if it has a similar effect? It's got the same basic optimizations you did, except that I avoided sync.Pool in favor of a small preallocated slice in Reader. It also has a new method on Reader:

func (pr *Reader) ReadBuffer(buf []byte) (Record, error)
    ReadBuffer is like Read except that it allows reusing buffers.

    buf is used as Record.RawSample if it is large enough to hold the sample
    data. If buf is too small a new buffer will be allocated instead. It is
    valid to pass nil, in which case ReadBuffer behaves like Read.

There is an example that shows how to use it, if its not clear. The benefit of this API is that we can return Record, error instead of cpu int, lost int, error like for Unmarshal. We also don't have to get into the business of unmarshaling things in the first place.

brycekahle · 2022-05-05T16:59:25Z

Can you please give https://github.com/lmb/ebpf/tree/reader-read-buffer a try in your test environment and see if it has a similar effect?

I'm testing this out today and will report back.

brycekahle · 2022-05-05T20:49:28Z

Tested out your branch, we get similar gains from it. If you are more comfortable taking your approach, then let's go for it.

lmb · 2022-05-06T15:44:34Z

Thanks for being patient and explaining the problem to me. ReadBuffer is the way to go if it satisfies your needs, since its less code overall. Please take a look at #663.

brycekahle mentioned this pull request May 1, 2022

Remove extra allocations from perf.Reader.Read() #617

Closed

brycekahle added 2 commits May 2, 2022 09:31

Add Unmarshal perf API to read without allocation

69e3ff8

Add benchmarks for Unmarshal API

a382158

brycekahle force-pushed the bryce.kahle/perf-unmarshal-api branch from 6f299b1 to a382158 Compare May 2, 2022 16:32

lmb closed this May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Unmarshal perf API to read without allocation #650

Add Unmarshal perf API to read without allocation #650

brycekahle commented May 1, 2022

ti-mo commented May 2, 2022

brycekahle commented May 2, 2022

brycekahle commented May 3, 2022

lmb commented May 3, 2022

brycekahle commented May 3, 2022

lmb commented May 4, 2022

brycekahle commented May 5, 2022

brycekahle commented May 5, 2022

lmb commented May 6, 2022

Add Unmarshal perf API to read without allocation #650

Add Unmarshal perf API to read without allocation #650

Conversation

brycekahle commented May 1, 2022

ti-mo commented May 2, 2022

brycekahle commented May 2, 2022

brycekahle commented May 3, 2022

lmb commented May 3, 2022

brycekahle commented May 3, 2022

lmb commented May 4, 2022

brycekahle commented May 5, 2022

brycekahle commented May 5, 2022

lmb commented May 6, 2022