New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Unmarshal perf API to read without allocation #650
Conversation
Hey Bryce, thanks for giving this another spin. I'm a bit concerned about the bigger picture here. I see 210 extra lines of code to maintain for a 0.3% execution speed improvement, with total alloc size improvements in the same order of magnitude. Golfing total allocs/op should not be the goal, I think we need something more impactiful than this to consider it worth it. A few thoughts:
|
It might be hard to show the gains from this at a library level. If you want, I could incorporate this change into our agent and put it through our load testing environment.
The main benefit is that the user of the library has total control of the allocation behavior.
I'm using the
The goal here was reducing allocs, not necessarily speed gains.
Sure! |
6f299b1
to
a382158
Compare
I went ahead and did this. By giving us control of allocations, we are able to save >15% of our RSS/working set memory. |
I agree with Timo that the change is pretty intrusive for what is superficially a small gain. Seems like my idea wasn't particularly good, sorry. I have some questions to help me understand what's going on:
|
Because we are processing data on a continuous basis, excess allocations on the hot path cause the heap size to increase between GC runs. This causes the GC to increase the NextGC value, and ultimately lead to a higher steady state RSS of our service. Reducing allocations has a significant impact, and we have already seen improvements from other strategies like string interning, using the new stack-based IP address type The allocations from the perf
It is a continuous environment where we are sending 3000 HTTP requests/sec at a host. That host is running multiple versions of our ebpf-based agent, so we can compare apples to apples with regard to the network traffic that they see.
It would be difficult.
We measure it every 20s, and plot it. It is not a flat line, but seems to oscillate around a value. We run the experiment for a long time (> 1 hour) to make sure spurious behaviors do not influence the analysis.
No significant changes to CPU for us.
I can take a look, but GC pauses generally aren't the problem for us. This isn't a web server where a GC pause will cause the 99th percentile latency to spike.
We haven't yet, but this is something we will likely do very soon. |
I took another look, I think my idea to use an "unmarshal into interface{}" API was bad. Sorry for sending you on a wild goose chase. Another thing I noticed is that the benchmark is dominated by submitting samples into the perf ring. This explains why the gains look small, when in reality they are really quite sizeable for the work we actually care about. Can you please give https://github.com/lmb/ebpf/tree/reader-read-buffer a try in your test environment and see if it has a similar effect? It's got the same basic optimizations you did, except that I avoided
There is an example that shows how to use it, if its not clear. The benefit of this API is that we can return |
I'm testing this out today and will report back. |
Tested out your branch, we get similar gains from it. If you are more comfortable taking your approach, then let's go for it. |
Thanks for being patient and explaining the problem to me. |
before:
after:
The one alloc is from
prog.Test
and only present during the benchmark.