New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slice: use reflect.Append
instead of pre-allocating whole slice.
#7
Conversation
@terorie pls take a look next week |
decoder.go
Outdated
|
||
func reflect_readArrayOfBytes(append bool, d *Decoder, l int, rv reflect.Value) error { | ||
for i := 0; i < l; i++ { | ||
n, err := d.ReadByte() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save a lot of bounds checking here by
- doing a bounds check for the entire array
- allocating the entire slice
- then filling it
Same for all the other primitive arrays and slices
reflect.Append
is worse than pre-allocation when the slice size is known because the end result is the same (in terms of memory usage), but the former is less efficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more optimization ideas:
- On little endian machines without alignment requirements, no conversion is necessary for integer primitive types like uint16`. We could simply unsafely copy into the backing memory buffer of a slice.
- In all simple copy cases, allocating a slice isn't even necessary – The decoder could unsafely construct a slice that points to a place in the serialized buffer
Thanks! Before:
After:
(NOTE: the name poorly reflects what happens inside in the AFTER) |
Okay, so now for slices of uints:
|
Ooops, wrong type in the benchmark. |
Final benchmark: BEFORE optimizing uint slices (underlying use of reflect.Append)goos: linux
goarch: amd64
pkg: github.com/gagliardetto/binary
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Benchmark_uintSlice64_Decode_noMake-8 3636 326426 ns/op 131864 B/op 5137 allocs/op
Benchmark_uintSlice64_Decode_make-8 3724 308653 ns/op 131864 B/op 5137 allocs/op
Benchmark_uintSlice64_Decode_field_noMake-8 3643 302873 ns/op 132008 B/op 5143 allocs/op
Benchmark_uintSlice64_Decode_field_make-8 3644 303124 ns/op 132008 B/op 5143 allocs/op
Benchmark_uintSlice64_reflect_noMake-8 24592 50456 ns/op 16480 B/op 1028 allocs/op
Benchmark_uintSlice64_reflect_make-8 24235 49264 ns/op 16480 B/op 1028 allocs/op
Benchmark_uintSlice32_Decode_noMake-8 4402 246662 ns/op 107296 B/op 4113 allocs/op
Benchmark_uintSlice32_Decode_make-8 4700 251511 ns/op 107296 B/op 4113 allocs/op
Benchmark_uintSlice32_Decode_field_noMake-8 4592 253552 ns/op 107440 B/op 4119 allocs/op
Benchmark_uintSlice32_Decode_field_make-8 4332 256575 ns/op 107440 B/op 4119 allocs/op
Benchmark_uintSlice32_reflect_noMake-8 173953 6631 ns/op 4192 B/op 4 allocs/op
Benchmark_uintSlice32_reflect_make-8 165073 6599 ns/op 4192 B/op 4 allocs/op
PASS
ok github.com/gagliardetto/binary 15.178s BEFORE optimizing uint slices (underlying use of reflect.MakeSlice + rv.Index(i).Set)goos: linux
goarch: amd64
pkg: github.com/gagliardetto/binary
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Benchmark_uintSlice64_Decode_noMake-8 5041 244716 ns/op 90272 B/op 4102 allocs/op
Benchmark_uintSlice64_Decode_make-8 5509 216908 ns/op 90272 B/op 4102 allocs/op
Benchmark_uintSlice64_Decode_field_noMake-8 5554 217488 ns/op 90416 B/op 4108 allocs/op
Benchmark_uintSlice64_Decode_field_make-8 4992 227863 ns/op 90416 B/op 4108 allocs/op
Benchmark_uintSlice64_reflect_noMake-8 22930 53717 ns/op 16480 B/op 1028 allocs/op
Benchmark_uintSlice64_reflect_make-8 23380 49189 ns/op 16480 B/op 1028 allocs/op
Benchmark_uintSlice32_Decode_noMake-8 6961 166826 ns/op 73888 B/op 3078 allocs/op
Benchmark_uintSlice32_Decode_make-8 6466 169598 ns/op 73888 B/op 3078 allocs/op
Benchmark_uintSlice32_Decode_field_noMake-8 6379 166451 ns/op 74032 B/op 3084 allocs/op
Benchmark_uintSlice32_Decode_field_make-8 6475 170186 ns/op 74032 B/op 3084 allocs/op
Benchmark_uintSlice32_reflect_noMake-8 179046 6314 ns/op 4192 B/op 4 allocs/op
Benchmark_uintSlice32_reflect_make-8 169556 6537 ns/op 4192 B/op 4 allocs/op
PASS
ok github.com/gagliardetto/binary 18.773s AFTER optimizing for uint slices$ go test -bench "uintSlice" -benchmem
goos: linux
goarch: amd64
pkg: github.com/gagliardetto/binary
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Benchmark_uintSlice64_Decode_noMake-8 22210 52099 ns/op 16544 B/op 1030 allocs/op
Benchmark_uintSlice64_Decode_make-8 22062 54300 ns/op 16544 B/op 1030 allocs/op
Benchmark_uintSlice64_Decode_field_noMake-8 21668 54316 ns/op 16688 B/op 1036 allocs/op
Benchmark_uintSlice64_Decode_field_make-8 22428 54018 ns/op 16688 B/op 1036 allocs/op
Benchmark_uintSlice64_reflect_noMake-8 22785 53440 ns/op 16480 B/op 1028 allocs/op
Benchmark_uintSlice64_reflect_make-8 22976 52906 ns/op 16480 B/op 1028 allocs/op
Benchmark_uintSlice32_Decode_noMake-8 149244 7155 ns/op 4256 B/op 6 allocs/op
Benchmark_uintSlice32_Decode_make-8 160652 7149 ns/op 4256 B/op 6 allocs/op
Benchmark_uintSlice32_Decode_field_noMake-8 151970 7551 ns/op 4400 B/op 12 allocs/op
Benchmark_uintSlice32_Decode_field_make-8 151761 7708 ns/op 4400 B/op 12 allocs/op
Benchmark_uintSlice32_reflect_noMake-8 160108 7006 ns/op 4192 B/op 4 allocs/op
Benchmark_uintSlice32_reflect_make-8 154717 7048 ns/op 4192 B/op 4 allocs/op
PASS
ok github.com/gagliardetto/binary 17.686s |
If this looks good, I'll proceed tomorrow to merge. @terorie |
After simplifying goos: linux
goarch: amd64
pkg: github.com/gagliardetto/binary
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Benchmark_uintSlice64_Decode_noMake-8 109407 10240 ns/op 8352 B/op 6 allocs/op
Benchmark_uintSlice64_Decode_make-8 104725 10590 ns/op 8352 B/op 6 allocs/op
Benchmark_uintSlice64_Decode_field_noMake-8 106695 10673 ns/op 8496 B/op 12 allocs/op
Benchmark_uintSlice64_Decode_field_make-8 109975 11053 ns/op 8496 B/op 12 allocs/op
Benchmark_uintSlice64_readArray_noMake-8 109645 10126 ns/op 8288 B/op 4 allocs/op
Benchmark_uintSlice64_readArray_make-8 104791 10162 ns/op 8288 B/op 4 allocs/op
Benchmark_uintSlice64_Decode_field_withCustomDecoder-8 110866 10326 ns/op 8240 B/op 2 allocs/op
Benchmark_uintSlice32_Decode_noMake-8 162374 7324 ns/op 4256 B/op 6 allocs/op
Benchmark_uintSlice32_Decode_make-8 140947 7607 ns/op 4256 B/op 6 allocs/op
Benchmark_uintSlice32_Decode_field_noMake-8 130076 7981 ns/op 4400 B/op 12 allocs/op
Benchmark_uintSlice32_Decode_field_make-8 127812 8376 ns/op 4400 B/op 12 allocs/op
Benchmark_uintSlice32_readArray_noMake-8 150466 7251 ns/op 4192 B/op 4 allocs/op
Benchmark_uintSlice32_readArray_make-8 160896 6962 ns/op 4192 B/op 4 allocs/op
Benchmark_uintSlice32_Decode_field_withCustomDecoder-8 151675 7046 ns/op 4144 B/op 2 allocs/op |
No description provided.