Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unmarshal perf API to read without allocation #650

Closed

Conversation

brycekahle
Copy link
Contributor

before:

goos: linux
goarch: arm64
pkg: github.com/cilium/ebpf/perf
BenchmarkReader
BenchmarkReader-4   	  293677	      3626 ns/op	     409 B/op	       6 allocs/op

after:

goos: linux
goarch: arm64
pkg: github.com/cilium/ebpf/perf
BenchmarkReader
BenchmarkReader-4              	  311409	      3507 ns/op	     384 B/op	       2 allocs/op
BenchmarkUnmarshalSlice
BenchmarkUnmarshalSlice-4      	  310951	      3667 ns/op	     288 B/op	       1 allocs/op
BenchmarkUnmarshalerSimple
BenchmarkUnmarshalerSimple-4   	  304730	      3669 ns/op	     288 B/op	       1 allocs/op

The one alloc is from prog.Test and only present during the benchmark.

@ti-mo
Copy link
Collaborator

ti-mo commented May 2, 2022

Hey Bryce, thanks for giving this another spin. I'm a bit concerned about the bigger picture here. I see 210 extra lines of code to maintain for a 0.3% execution speed improvement, with total alloc size improvements in the same order of magnitude. Golfing total allocs/op should not be the goal, I think we need something more impactiful than this to consider it worth it.

A few thoughts:

  • Using a sync.Pool for storing integers seems rather counterproductive. I suspect this undoes the gains gotten by adding custom unmarshalers.
  • Eliminating calls to binary.Read in favour of implementing BinaryUnmarshalers should result in execution speed gains, but only noticeably on complex types, since those require reflection. binary.Read is relatively fast on builtins, so implementing unmarshalers for those might not be worth the LoC. Ideally, this should be benchmarked this in isolation to make sure.
  • Could the benchmarks and changes to tests be split off into separate commits?

@brycekahle
Copy link
Contributor Author

Golfing total allocs/op should not be the goal

It might be hard to show the gains from this at a library level. If you want, I could incorporate this change into our agent and put it through our load testing environment.

with total alloc size improvements in the same order of magnitude.

BenchmarkReader is maintaining the previous behavior, which still does a copy and alloc in this library. Using Unmarshal results in 0 allocs/op from the library itself, which is a gain of 96 bytes/op and 5 allocs/op.

The main benefit is that the user of the library has total control of the allocation behavior.

Using a sync.Pool for storing integers seems rather counterproductive. I suspect this undoes the gains gotten by adding custom unmarshalers.

I'm using the sync.Pool for the slice that is necessary to read the uint32. This is because it can span the end of the circular buffer, so there is not an easy stack-based way to read the value (my other PR did this, but didn't account for the end of the buffer).

Eliminating calls to binary.Read in favour of implementing BinaryUnmarshalers should result in execution speed gains

The goal here was reducing allocs, not necessarily speed gains.

Could the benchmarks and changes to tests be split off into separate commits?

Sure!

@brycekahle brycekahle force-pushed the bryce.kahle/perf-unmarshal-api branch from 6f299b1 to a382158 Compare May 2, 2022 16:32
@brycekahle
Copy link
Contributor Author

If you want, I could incorporate this change into our agent and put it through our load testing environment.

I went ahead and did this. By giving us control of allocations, we are able to save >15% of our RSS/working set memory.

@lmb
Copy link
Collaborator

lmb commented May 3, 2022

I agree with Timo that the change is pretty intrusive for what is superficially a small gain. Seems like my idea wasn't particularly good, sorry. I have some questions to help me understand what's going on:

  • What does your load testing look like? Is it "max throughput" with given resources, or does it simulate a prod deployment?
  • Is there a way for us to reproduce your load testing?
  • At which point did you take RSS, some steady state? Did you average it or something similar?
  • Are you seeing CPU usage go down as well? Any changes to GC pause metrics, or how frequently the GC runs?
  • Do you tweak GOGC or the like?

@brycekahle
Copy link
Contributor Author

Because we are processing data on a continuous basis, excess allocations on the hot path cause the heap size to increase between GC runs. This causes the GC to increase the NextGC value, and ultimately lead to a higher steady state RSS of our service. Reducing allocations has a significant impact, and we have already seen improvements from other strategies like string interning, using the new stack-based IP address type netaddr.IP, using sync.Pools, and others.

The allocations from the perf Read are not the biggest slice of the pie, but right now we have no ability to control the allocation behavior (without forking, something we'd like to avoid, if possible).

  • What does your load testing look like? Is it "max throughput" with given resources, or does it simulate a prod deployment?

It is a continuous environment where we are sending 3000 HTTP requests/sec at a host. That host is running multiple versions of our ebpf-based agent, so we can compare apples to apples with regard to the network traffic that they see.

  • Is there a way for us to reproduce your load testing?

It would be difficult.

  • At which point did you take RSS, some steady state? Did you average it or something similar?

We measure it every 20s, and plot it. It is not a flat line, but seems to oscillate around a value. We run the experiment for a long time (> 1 hour) to make sure spurious behaviors do not influence the analysis.

  • Are you seeing CPU usage go down as well?

No significant changes to CPU for us.

Any changes to GC pause metrics, or how frequently the GC runs?

I can take a look, but GC pauses generally aren't the problem for us. This isn't a web server where a GC pause will cause the 99th percentile latency to spike.

  • Do you tweak GOGC or the like?

We haven't yet, but this is something we will likely do very soon.

@lmb
Copy link
Collaborator

lmb commented May 4, 2022

I took another look, I think my idea to use an "unmarshal into interface{}" API was bad. Sorry for sending you on a wild goose chase. Another thing I noticed is that the benchmark is dominated by submitting samples into the perf ring. This explains why the gains look small, when in reality they are really quite sizeable for the work we actually care about.

Can you please give https://github.com/lmb/ebpf/tree/reader-read-buffer a try in your test environment and see if it has a similar effect? It's got the same basic optimizations you did, except that I avoided sync.Pool in favor of a small preallocated slice in Reader. It also has a new method on Reader:

func (pr *Reader) ReadBuffer(buf []byte) (Record, error)
    ReadBuffer is like Read except that it allows reusing buffers.

    buf is used as Record.RawSample if it is large enough to hold the sample
    data. If buf is too small a new buffer will be allocated instead. It is
    valid to pass nil, in which case ReadBuffer behaves like Read.

There is an example that shows how to use it, if its not clear. The benefit of this API is that we can return Record, error instead of cpu int, lost int, error like for Unmarshal. We also don't have to get into the business of unmarshaling things in the first place.

@brycekahle
Copy link
Contributor Author

Can you please give https://github.com/lmb/ebpf/tree/reader-read-buffer a try in your test environment and see if it has a similar effect?

I'm testing this out today and will report back.

@brycekahle
Copy link
Contributor Author

Tested out your branch, we get similar gains from it. If you are more comfortable taking your approach, then let's go for it.

@lmb
Copy link
Collaborator

lmb commented May 6, 2022

Thanks for being patient and explaining the problem to me. ReadBuffer is the way to go if it satisfies your needs, since its less code overall. Please take a look at #663.

@lmb lmb closed this May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants