Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-allocate response body using Content-Length #1565

Merged
merged 7 commits into from Jan 24, 2022

Conversation

tysonmote
Copy link
Contributor

@tysonmote tysonmote commented Jan 20, 2022

This saves us allocations when the response body is greater than 512 bytes. For large responses, like multi-megabyte Lambda Invoke responses, this can be a very significant number of allocations and slice copies because ioutil.ReadAll / io.ReadAll start with 512 bytes and follow append's growth rules:

I saw this when profiling some code that makes heavy use of the Lambda Invoke method:

147713651-3a352301-f9ec-4597-b9d3-43a7cabff75b

This saves us allocations when the response body is greater than 512
bytes. For large responses, like multi-megabyte Lambda Invoke responses,
this can be a very significant number of allocations and slice copies
because ioutil.ReadAll / io.ReadAll start with 512 bytes and follow
append's growth rules:

* https://cs.opensource.google/go/go/+/refs/tags/go1.17.6:src/io/io.go;drc=dc289d3dcb59f80b9e23c7e8f237628359d21d92;l=627
* https://cs.opensource.google/go/go/%20/master:src/runtime/slice.go;l=166?q=growslice
@tysonmote
Copy link
Contributor Author

I had trouble getting tests to run locally (I'm not a Java person and I couldn't figure out the right incantation of configuration/magic), so I'm hoping the build will surface any issues. :)

@jasdel
Copy link
Contributor

jasdel commented Jan 20, 2022

Thanks for putting this PR together @tysonmote, and adding the concept of pre-allocating buffer for byte slice and string payloads. We'll get this PR reviewed and submit feedback. I've also kicked off the GitHub CI tasks for testing.

Generating the API clients (and protocol test) is currently the best way to validate the codegen module's behavior.

@jasdel
Copy link
Contributor

jasdel commented Jan 20, 2022

Also, out of curiosity was this flamegraph on allocated objects, or time?

@tysonmote
Copy link
Contributor Author

tysonmote commented Jan 20, 2022

The flamegraph is CPU time. This is from a worker that calls Invoke thousands of times per second and then forwards results to another plain-vanilla HTTP API, so it's a bit of a stress test of the AWS SDK request path. In the cases where the Invoke response is large, this causes an outsized amount of work being done to grow and copy the io.ReadAll slice's underlying array. This also causes a lot of GC work, as well, to clean up all those intermediate arrays. To iteratively grow a slice from 512 bytes to 1 MB is roughly 34 array allocation and copy operations.

@jasdel
Copy link
Contributor

jasdel commented Jan 20, 2022

Updated PR to limit contentLength parameter to non-streaming blob or string deserialization functions, and added benchmark for byte slice, and string payloads. Comparing impact of change. Benchmarks used 12MB payloads, which are far larger than most of these operations will use, but exercised the impact proportionally.

Amazon S3 (service/s3) GetBucketPolicy benchmark

name                old time/op    new time/op    delta
GetBucketPolicy-12    12.7ms ± 2%     8.7ms ± 2%  -31.60%  (p=0.000 n=9+9)

name                old alloc/op   new alloc/op   delta
GetBucketPolicy-12    83.4MB ± 0%    50.4MB ± 0%  -39.60%  (p=0.000 n=10+10)

name                old allocs/op  new allocs/op  delta
GetBucketPolicy-12       256 ± 0%       216 ± 0%  -15.49%  (p=0.000 n=10+10)

Amazon EventBridge Schema Registry (service/schemas) GetCodeBindingSource benchmarks:

name                     old time/op    new time/op    delta
GetCodeBindingSource-12    10.8ms ± 3%     7.4ms ± 4%  -31.79%  (p=0.000 n=10+10)

name                     old alloc/op   new alloc/op   delta
GetCodeBindingSource-12    70.8MB ± 0%    37.8MB ± 0%  -46.64%  (p=0.000 n=10+10)

name                     old allocs/op  new allocs/op  delta
GetCodeBindingSource-12       241 ± 0%       200 ± 0%  -17.01%  (p=0.000 n=10+10)

S3 GetBucketPolicy benchmark:

name                old time/op    new time/op    delta
GetBucketPolicy-12    12.7ms ± 2%     8.7ms ± 2%  -31.60%  (p=0.000 n=9+9)

name                old alloc/op   new alloc/op   delta
GetBucketPolicy-12    83.4MB ± 0%    50.4MB ± 0%  -39.60%  (p=0.000 n=10+10)

name                old allocs/op  new allocs/op  delta
GetBucketPolicy-12       256 ± 0%       216 ± 0%  -15.49%  (p=0.000 n=10+10)

Schemas GetCodeBindingSource benchmarks:

name                     old time/op    new time/op    delta
GetCodeBindingSource-12    10.8ms ± 3%     7.4ms ± 4%  -31.79%  (p=0.000 n=10+10)

name                     old alloc/op   new alloc/op   delta
GetCodeBindingSource-12    70.8MB ± 0%    37.8MB ± 0%  -46.64%  (p=0.000 n=10+10)

name                     old allocs/op  new allocs/op  delta
GetCodeBindingSource-12       241 ± 0%       200 ± 0%  -17.01%  (p=0.000 n=10+10)
@jasdel jasdel merged commit a1bf7dd into aws:main Jan 24, 2022
jrichard8 pushed a commit to jrichard8/aws-sdk-go-v2 that referenced this pull request Feb 14, 2022
Updates SDK API client deserialization to pre-allocate byte slice and string
response payloads.

This saves us allocations when the response body is greater than 512
bytes. For large responses, like multi-megabyte Lambda Invoke responses,
this can be a very significant number of allocations and slice copies
because ioutil.ReadAll / io.ReadAll start with 512 bytes and follow
append's growth rules:

* https://cs.opensource.google/go/go/+/refs/tags/go1.17.6:src/io/io.go;drc=dc289d3dcb59f80b9e23c7e8f237628359d21d92;l=627
* https://cs.opensource.google/go/go/%20/master:src/runtime/slice.go;l=166?q=growslice

S3 GetBucketPolicy benchmark:

name                old time/op    new time/op    delta
GetBucketPolicy-12    12.7ms ± 2%     8.7ms ± 2%  -31.60%  (p=0.000 n=9+9)

name                old alloc/op   new alloc/op   delta
GetBucketPolicy-12    83.4MB ± 0%    50.4MB ± 0%  -39.60%  (p=0.000 n=10+10)

name                old allocs/op  new allocs/op  delta
GetBucketPolicy-12       256 ± 0%       216 ± 0%  -15.49%  (p=0.000 n=10+10)

Schemas GetCodeBindingSource benchmarks:

name                     old time/op    new time/op    delta
GetCodeBindingSource-12    10.8ms ± 3%     7.4ms ± 4%  -31.79%  (p=0.000 n=10+10)

name                     old alloc/op   new alloc/op   delta
GetCodeBindingSource-12    70.8MB ± 0%    37.8MB ± 0%  -46.64%  (p=0.000 n=10+10)

name                     old allocs/op  new allocs/op  delta
GetCodeBindingSource-12       241 ± 0%       200 ± 0%  -17.01%  (p=0.000 n=10+10)

* Add changelog

Co-authored-by: Jason Del Ponte <961963+jasdel@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants