Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slice: use reflect.Append instead of pre-allocating whole slice. #7

Merged
merged 15 commits into from Aug 28, 2022

Conversation

gagliardetto
Copy link
Owner

No description provided.

@gagliardetto
Copy link
Owner Author

@terorie pls take a look next week

decoder.go Outdated

func reflect_readArrayOfBytes(append bool, d *Decoder, l int, rv reflect.Value) error {
for i := 0; i < l; i++ {
n, err := d.ReadByte()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save a lot of bounds checking here by

  • doing a bounds check for the entire array
  • allocating the entire slice
  • then filling it

Same for all the other primitive arrays and slices

reflect.Append is worse than pre-allocation when the slice size is known because the end result is the same (in terms of memory usage), but the former is less efficient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more optimization ideas:

  • On little endian machines without alignment requirements, no conversion is necessary for integer primitive types like uint16`. We could simply unsafely copy into the backing memory buffer of a slice.
  • In all simple copy cases, allocating a slice isn't even necessary – The decoder could unsafely construct a slice that points to a place in the serialized buffer

@gagliardetto
Copy link
Owner Author

gagliardetto commented Aug 22, 2022

Thanks!

Before:

Benchmark_uintSlices_Decode-8     	    6078	    179633 ns/op	   38688 B/op	    1297 allocs/op
Benchmark_uintSlices_append-8     	    3704	    302197 ns/op	   64144 B/op	    2829 allocs/op
Benchmark_uintSlices_allocate-8   	   10000	    103067 ns/op	   16408 B/op	    1026 allocs/op

After:

Benchmark_uintSlices_Decode-8        	   95216	     12386 ns/op	    4256 B/op	       6 allocs/op
Benchmark_uintSlices_append-8        	   13005	     92601 ns/op	   16432 B/op	    1027 allocs/op
Benchmark_uintSlices_preallocate-8   	   12608	     92314 ns/op	   24624 B/op	    1028 allocs/op

(NOTE: the name poorly reflects what happens inside in the AFTER)

@gagliardetto
Copy link
Owner Author

gagliardetto commented Aug 22, 2022

Okay, so now for slices of uints:

  • checking if remaining bytes are enough for the desired slice size.
  • slice of that specific type is pre-allocated.
  • no reflect is involved in the creation of the slice or the writing to it.
  • when the slice is ready, it's set upstream using reflect.

@gagliardetto
Copy link
Owner Author

Ooops, wrong type in the benchmark.

@gagliardetto
Copy link
Owner Author

gagliardetto commented Aug 22, 2022

Final benchmark:

BEFORE optimizing uint slices (underlying use of reflect.Append)

goos: linux
goarch: amd64
pkg: github.com/gagliardetto/binary
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Benchmark_uintSlice64_Decode_noMake-8         	    3636	    326426 ns/op	  131864 B/op	    5137 allocs/op
Benchmark_uintSlice64_Decode_make-8           	    3724	    308653 ns/op	  131864 B/op	    5137 allocs/op
Benchmark_uintSlice64_Decode_field_noMake-8   	    3643	    302873 ns/op	  132008 B/op	    5143 allocs/op
Benchmark_uintSlice64_Decode_field_make-8     	    3644	    303124 ns/op	  132008 B/op	    5143 allocs/op
Benchmark_uintSlice64_reflect_noMake-8        	   24592	     50456 ns/op	   16480 B/op	    1028 allocs/op
Benchmark_uintSlice64_reflect_make-8          	   24235	     49264 ns/op	   16480 B/op	    1028 allocs/op
Benchmark_uintSlice32_Decode_noMake-8         	    4402	    246662 ns/op	  107296 B/op	    4113 allocs/op
Benchmark_uintSlice32_Decode_make-8           	    4700	    251511 ns/op	  107296 B/op	    4113 allocs/op
Benchmark_uintSlice32_Decode_field_noMake-8   	    4592	    253552 ns/op	  107440 B/op	    4119 allocs/op
Benchmark_uintSlice32_Decode_field_make-8     	    4332	    256575 ns/op	  107440 B/op	    4119 allocs/op
Benchmark_uintSlice32_reflect_noMake-8        	  173953	      6631 ns/op	    4192 B/op	       4 allocs/op
Benchmark_uintSlice32_reflect_make-8          	  165073	      6599 ns/op	    4192 B/op	       4 allocs/op
PASS
ok  	github.com/gagliardetto/binary	15.178s

BEFORE optimizing uint slices (underlying use of reflect.MakeSlice + rv.Index(i).Set)

goos: linux
goarch: amd64
pkg: github.com/gagliardetto/binary
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Benchmark_uintSlice64_Decode_noMake-8         	    5041	    244716 ns/op	   90272 B/op	    4102 allocs/op
Benchmark_uintSlice64_Decode_make-8           	    5509	    216908 ns/op	   90272 B/op	    4102 allocs/op
Benchmark_uintSlice64_Decode_field_noMake-8   	    5554	    217488 ns/op	   90416 B/op	    4108 allocs/op
Benchmark_uintSlice64_Decode_field_make-8     	    4992	    227863 ns/op	   90416 B/op	    4108 allocs/op
Benchmark_uintSlice64_reflect_noMake-8        	   22930	     53717 ns/op	   16480 B/op	    1028 allocs/op
Benchmark_uintSlice64_reflect_make-8          	   23380	     49189 ns/op	   16480 B/op	    1028 allocs/op
Benchmark_uintSlice32_Decode_noMake-8         	    6961	    166826 ns/op	   73888 B/op	    3078 allocs/op
Benchmark_uintSlice32_Decode_make-8           	    6466	    169598 ns/op	   73888 B/op	    3078 allocs/op
Benchmark_uintSlice32_Decode_field_noMake-8   	    6379	    166451 ns/op	   74032 B/op	    3084 allocs/op
Benchmark_uintSlice32_Decode_field_make-8     	    6475	    170186 ns/op	   74032 B/op	    3084 allocs/op
Benchmark_uintSlice32_reflect_noMake-8        	  179046	      6314 ns/op	    4192 B/op	       4 allocs/op
Benchmark_uintSlice32_reflect_make-8          	  169556	      6537 ns/op	    4192 B/op	       4 allocs/op
PASS
ok  	github.com/gagliardetto/binary	18.773s

AFTER optimizing for uint slices

$ go test  -bench "uintSlice" -benchmem

goos: linux
goarch: amd64
pkg: github.com/gagliardetto/binary
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Benchmark_uintSlice64_Decode_noMake-8         	   22210	     52099 ns/op	   16544 B/op	    1030 allocs/op
Benchmark_uintSlice64_Decode_make-8           	   22062	     54300 ns/op	   16544 B/op	    1030 allocs/op
Benchmark_uintSlice64_Decode_field_noMake-8   	   21668	     54316 ns/op	   16688 B/op	    1036 allocs/op
Benchmark_uintSlice64_Decode_field_make-8     	   22428	     54018 ns/op	   16688 B/op	    1036 allocs/op
Benchmark_uintSlice64_reflect_noMake-8        	   22785	     53440 ns/op	   16480 B/op	    1028 allocs/op
Benchmark_uintSlice64_reflect_make-8          	   22976	     52906 ns/op	   16480 B/op	    1028 allocs/op
Benchmark_uintSlice32_Decode_noMake-8         	  149244	      7155 ns/op	    4256 B/op	       6 allocs/op
Benchmark_uintSlice32_Decode_make-8           	  160652	      7149 ns/op	    4256 B/op	       6 allocs/op
Benchmark_uintSlice32_Decode_field_noMake-8   	  151970	      7551 ns/op	    4400 B/op	      12 allocs/op
Benchmark_uintSlice32_Decode_field_make-8     	  151761	      7708 ns/op	    4400 B/op	      12 allocs/op
Benchmark_uintSlice32_reflect_noMake-8        	  160108	      7006 ns/op	    4192 B/op	       4 allocs/op
Benchmark_uintSlice32_reflect_make-8          	  154717	      7048 ns/op	    4192 B/op	       4 allocs/op
PASS
ok  	github.com/gagliardetto/binary	17.686s

@gagliardetto
Copy link
Owner Author

If this looks good, I'll proceed tomorrow to merge. @terorie

@gagliardetto gagliardetto merged commit 2b26380 into master Aug 28, 2022
@gagliardetto
Copy link
Owner Author

After simplifying readNBytes:

goos: linux
goarch: amd64
pkg: github.com/gagliardetto/binary
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Benchmark_uintSlice64_Decode_noMake-8                    	  109407	     10240 ns/op	    8352 B/op	       6 allocs/op
Benchmark_uintSlice64_Decode_make-8                      	  104725	     10590 ns/op	    8352 B/op	       6 allocs/op
Benchmark_uintSlice64_Decode_field_noMake-8              	  106695	     10673 ns/op	    8496 B/op	      12 allocs/op
Benchmark_uintSlice64_Decode_field_make-8                	  109975	     11053 ns/op	    8496 B/op	      12 allocs/op
Benchmark_uintSlice64_readArray_noMake-8                 	  109645	     10126 ns/op	    8288 B/op	       4 allocs/op
Benchmark_uintSlice64_readArray_make-8                   	  104791	     10162 ns/op	    8288 B/op	       4 allocs/op
Benchmark_uintSlice64_Decode_field_withCustomDecoder-8   	  110866	     10326 ns/op	    8240 B/op	       2 allocs/op
Benchmark_uintSlice32_Decode_noMake-8                    	  162374	      7324 ns/op	    4256 B/op	       6 allocs/op
Benchmark_uintSlice32_Decode_make-8                      	  140947	      7607 ns/op	    4256 B/op	       6 allocs/op
Benchmark_uintSlice32_Decode_field_noMake-8              	  130076	      7981 ns/op	    4400 B/op	      12 allocs/op
Benchmark_uintSlice32_Decode_field_make-8                	  127812	      8376 ns/op	    4400 B/op	      12 allocs/op
Benchmark_uintSlice32_readArray_noMake-8                 	  150466	      7251 ns/op	    4192 B/op	       4 allocs/op
Benchmark_uintSlice32_readArray_make-8                   	  160896	      6962 ns/op	    4192 B/op	       4 allocs/op
Benchmark_uintSlice32_Decode_field_withCustomDecoder-8   	  151675	      7046 ns/op	    4144 B/op	       2 allocs/op

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants