optimise bitmaps #297

seiflotfy · 2021-01-19T12:22:13Z

Shifting bits and using binary operations instead of multiplication, division and modulo yielded the following results.

name                                                                  old time/op    new time/op    delta
ShortIteratorAdvanceArray/init_size_1_shortIterator_advance-12          41.8ns ± 0%    40.3ns ± 0%   -3.48%  (p=0.016 n=4+5)
ShortIteratorAdvanceArray/init_size_650_shortIterator_advance-12        77.2ns ± 1%    76.1ns ± 2%     ~     (p=0.659 n=5+5)
ShortIteratorAdvanceArray/init_size_6500_shortIterator_advance-12       86.8ns ± 2%    89.4ns ± 5%     ~     (p=0.056 n=5+5)
ShortIteratorAdvanceArray/init_size_65535_shortIterator_advance-12      90.2ns ± 0%    90.0ns ± 2%     ~     (p=0.651 n=5+5)
ShortIteratorNextArray/init_size_1_shortIterator_next-12                37.9ns ± 1%    37.4ns ± 6%     ~     (p=0.135 n=5+5)
ShortIteratorNextArray/init_size_650_shortIterator_next-12              2.35µs ± 3%    2.23µs ± 1%   -5.29%  (p=0.008 n=5+5)
ShortIteratorNextArray/init_size_6500_shortIterator_next-12             22.2µs ± 1%    22.0µs ± 3%     ~     (p=0.690 n=5+5)
ShortIteratorNextArray/init_size_65535_shortIterator_next-12             226µs ± 1%     230µs ± 9%     ~     (p=0.841 n=5+5)
ShortIteratorAdvanceBitmap/init_size_1_shortIterator_advance-12         37.9ns ± 1%    36.2ns ± 1%   -4.49%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_650_shortIterator_advance-12       42.1ns ± 1%    38.9ns ± 0%   -7.74%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_6500_shortIterator_advance-12      44.1ns ± 3%    38.8ns ± 1%  -12.14%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_65535_shortIterator_advance-12     44.9ns ±17%    38.9ns ± 0%  -13.36%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_1_shortIterator_next-12                582ns ± 2%     313ns ± 1%  -46.25%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_650_shortIterator_next-12             5.14µs ± 1%    4.36µs ± 1%  -15.24%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_6500_shortIterator_next-12            45.8µs ± 2%    40.8µs ± 1%  -10.93%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_65535_shortIterator_next-12            455µs ± 0%     416µs ± 4%   -8.45%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_1_shortIterator_advance-12        43.8ns ± 1%    42.2ns ± 2%   -3.47%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_650_shortIterator_advance-12      49.2ns ± 1%    47.5ns ± 1%   -3.45%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_6500_shortIterator_advance-12     49.2ns ± 1%    47.6ns ± 1%   -3.21%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_65535_shortIterator_advance-12    49.0ns ± 1%    47.9ns ± 1%   -2.16%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_1_shortIterator_next-12              40.4ns ± 0%    39.2ns ± 1%   -2.97%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_650_shortIterator_next-12            3.60µs ± 2%    3.52µs ± 3%     ~     (p=0.095 n=5+5)
ShortIteratorNextRuntime/init_size_6500_shortIterator_next-12           35.1µs ± 1%    34.2µs ± 0%   -2.71%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_65535_shortIterator_next-12           352µs ± 1%     345µs ± 0%   -2.04%  (p=0.008 n=5+5)

coveralls · 2021-01-19T12:33:49Z

Coverage remained the same at 86.986% when pulling b55300d on seiflotfy:better-bitmap into 6ec715d on RoaringBitmap:master.

bitmapcontainer.go

lemire · 2021-01-19T13:45:42Z

bitmapcontainer.go

@@ -20,8 +22,8 @@ func (bc bitmapContainer) String() string {

 func newBitmapContainer() *bitmapContainer {
 	p := new(bitmapContainer)
-	size := (1 << 16) / 64
-	p.bitmap = make([]uint64, size, size)
+	size := (1 << 16) >> log2WordSize


Do you expect this to be faster?

Updated the benchmarks and the branch to show only the changes based on shifting. this might be a bit over the top but I don't think the Golang compiler is optimised for translating a division to a number of the power 2

Seems like its faster right?

Yes. Both size := (1 << 16) / 64 and size := (1 << 16) >> log2WordSize. are strictly equivalent. We can check the assembly.

Seems like its faster right?

I have not seriously looked at your benchmark numbers, but we need a rationale for why it is faster. Introducing a branch, for example, may make a benchmark appear to be faster, but that can be misleading.

There may be places where introducing a shift as opposed to a division by a compile-time power of two could be faster, mostly due to signed variables...

But the Go compiler most certainly can make these optimizations.

I'll try to take a deeper look this week why this might be the case. I'll see if there is a case for the signed variables. If so then I will change the code accordingly.

lemire · 2021-01-19T15:23:46Z

Thanks for the PR.

I am going to take a critical stance here, however... because the Go compiler will definitively compile these functions to the same output:

func f1(x uint) uint {
    return x / 64
}

func f2(x uint) uint {
    return x >> 6
}

So we need to examine these changes we care.

lemire · 2021-01-19T15:30:36Z

bitmapcontainer.go

-		firstWord := firstOfRun / 64
-		lastWord := lastOfRun / 64
+		firstWord := firstOfRun >> log2WordSize
+		lastWord := lastOfRun >> log2WordSize


Here this will result in different compiled code because firstOfRun and lastOfRun are signed integers.

seiflotfy · 2021-01-19T15:37:47Z

bitmapcontainer.go

@@ -314,11 +316,11 @@ func (bc *bitmapContainer) iaddReturnMinimized(i uint16) container {

 func (bc *bitmapContainer) iadd(i uint16) bool {
 	x := int(i)


why is this being converted into an int, why not operate directly with the uint16 or make it a uint

It is entirely possible that it is an anti-pattern from a performance point of view and something that needs to be fixed.

Changing it to x := uint(i) does not change the assembly in this case.

Bypassing the variable x seems to add an extra assembly instruction and is seemingly not cheaper.

lemire

@seiflotfy

This is a well motivated PR and there are almost certainly some real gains. We want to go forward with this PR.

Before we do, please consider the following:

I can do shift and masks in my head and know exactly what they mean, and presumably, you can too. For many other programmers, however, it makes the code hard to read. I submit to you that we should always try to favour code that is easy to read, unless we can demonstrate the benefits of our "hard to read code".
The Go compiler is definitively smart enough to turn x/64 and x%64 into fast instructions when x is unsigned. When x is signed... well, your current PR changes the semantics by forcing the assumption that the variable is unsigned. We might as well go one step further and actually make the variable unsigned and then rely on the compiler. I submit to you that all cases except for my next point could be solved by turning the variables into unsigned integers. That is, I think you can achieve the very same result just by switching the signed integers to unsigned integers. If I am wrong, it would be interesting to know why and document it.
At one precise point you turn a branchless routine into a branchy routine. I see you have reverted that. If you believe this is faster, then one should be really careful about the benchmark because if you use a benchmark where the branch is easily predicted, you will get artificially good results. You have to include a benchmark where the branch is really hard to predict... and in this case, it means that iadd should be sometimes adding a value that is already present, and sometimes not. You can look at the assembly of the current iadd function: https://godbolt.org/z/38qhjr As you can see, it is really, really cheap.
Please have a look at the assembly of the functions: if the assembly is identical, then there is no need for benchmarking. And if the instruction count is reduced, then we also do not need benchmarking. That is, when tweaking purely computational code the way you do, I believe it is best to look at the assembly and just 'count the instructions'. That is, you turn a division into a shift, if you should be able to see it in the assembly, and then we don't need to benchmark, we know that a shift is cheaper. One can use godbolt or other more direct techniques to examine the assembly.

seiflotfy · 2021-01-19T16:08:07Z

On it boss :)
I'm really excited about this since we use it intensively...

lemire · 2021-01-19T16:13:20Z

On it boss

Note that I am not trying to be difficult.

The approach you are taking is the one I took for my code contributions to https://github.com/willf/bitset/blob/master/bitset.go

So I wrote many of the same lines of code before!!!

The Java version of Roaring is also filled with shifts and masks.

But I have gained a bit of trust in the Go compiler since then, and I would write the code differently.

seiflotfy · 2021-01-19T16:14:51Z

@seiflotfy

This is a well motivated PR and there are almost certainly some real gains. We want to go forward with this PR.

Before we do, please consider the following:

I can do shift and masks in my head and know exactly what they mean, and presumably, you can too. For many other programmers, however, it makes the code hard to read. I submit to you that we should always try to favour code that is easy to read, unless we can demonstrate the benefits of our "hard to read code".

I agree I prefer readable code over non-readable. So I will try to keep it that way.

The Go compiler is definitively smart enough to turn x/64 and x%64 into fast instructions when x is unsigned. When x is signed... well, your current PR changes the semantics by forcing the assumption that the variable is unsigned. We might as well go one step further and actually make the variable unsigned and then rely on the compiler. I submit to you that all cases except for my next point could be solved by turning the variables into unsigned integers. That is, I think you can achieve the very same result just by switching the signed integers to unsigned integers. If I am wrong, it would be interesting to know why and document it.

I will try to switch to unsigned integers and will document the results.

At one precise point you turn a branchless routine into a branchy routine. I see you have reverted that. If you believe this is faster, then one should be really careful about the benchmark because if you use a benchmark where the branch is easily predicted, you will get artificially good results. You have to include a benchmark where the branch is really hard to predict... and in this case, it means that iadd should be sometimes adding a value that is already present, and sometimes not. You can look at the assembly of the current iadd function: https://godbolt.org/z/38qhjr As you can see, it is really, really cheap.

I want to tackle the shifting problem before getting deeper into this so I will do this in a separate branch later :)

Please have a look at the assembly of the functions: if the assembly is identical, then there is no need for benchmarking. And if the instruction count is reduced, then we also do not need benchmarking. That is, when tweaking purely computational code the way you do, I believe it is best to look at the assembly and just 'count the instructions'. That is, you turn a division into a shift, if you should be able to see it in the assembly, and then we don't need to benchmark, we know that a shift is cheaper. One can use godbolt or other more direct techniques to examine the assembly.

I will look into the assembly functions and try to document my findings as part of the PR :)

seiflotfy · 2021-01-25T14:38:52Z

is there a possibility of the argument i in func (bc *bitmapContainer) NextSetBit(i int) int being negative? I don't think so

lemire · 2021-01-25T14:44:14Z

@seiflotfy

It can return a negative value:

func (bc *bitmapContainer) NextSetBit(i int) int {
	x := i / 64
	if x >= len(bc.bitmap) {
		return -1
	}
	w := bc.bitmap[x]
	w = w >> uint(i%64)
	if w != 0 {
		return i + countTrailingZeros(w)
	}
	x++
	for ; x < len(bc.bitmap); x++ {
		if bc.bitmap[x] != 0 {
			return (x * 64) + countTrailingZeros(bc.bitmap[x])
		}
	}
	return -1
}

But if you pass it a negative value (as an input) then you almost surely have a bug. (This could be explicitly documented.)

So yeah, the input could be made unsigned.

seiflotfy · 2021-01-25T14:48:45Z

with a little change

func (bc *bitmapContainer) NextSetBit(i int) int {
	var (
		x      = uint(i) / 64
		length = uint(len(bc.bitmap))
	)

	if x >= length {
		return -1
	}

	w := bc.bitmap[x]
	w = w >> (uint(i) % 64)
	if w != 0 {
		return i + countTrailingZeros(w)
	}
	x++

	for ; x < length; x++ {
		if bc.bitmap[x] != 0 {
			return int(x*64) + countTrailingZeros(bc.bitmap[x])
		}
	}

	return -1
}

I was able to improve the iterator benchmarks


name                                                           old time/op    new time/op    delta
ShortIteratorNextBitmap/init_size_1_shortIterator_next-12         561ns ± 1%     312ns ± 0%  -44.39%  (p=0.002 n=10+4)
ShortIteratorNextBitmap/init_size_650_shortIterator_next-12      5.02µs ± 0%    4.35µs ± 1%  -13.36%  (p=0.001 n=9+5)
ShortIteratorNextBitmap/init_size_6500_shortIterator_next-12     45.0µs ± 0%    40.6µs ± 1%   -9.81%  (p=0.001 n=10+5)
ShortIteratorNextBitmap/init_size_65535_shortIterator_next-12     449µs ± 0%     409µs ± 1%   -8.91%  (p=0.001 n=9+5)

name                                                           old alloc/op   new alloc/op   delta
ShortIteratorNextBitmap/init_size_1_shortIterator_next-12         16.0B ± 0%     16.0B ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_650_shortIterator_next-12       16.0B ± 0%     16.0B ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_6500_shortIterator_next-12      16.0B ± 0%     16.0B ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_65535_shortIterator_next-12     16.0B ± 0%     16.0B ± 0%     ~     (all equal)

name                                                           old allocs/op  new allocs/op  delta
ShortIteratorNextBitmap/init_size_1_shortIterator_next-12          1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_650_shortIterator_next-12        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_6500_shortIterator_next-12       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_65535_shortIterator_next-12      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

A cleaner way is to change the input to unsigned as you mentioned... Which one do you prefer?

lemire · 2021-01-25T14:57:57Z

A cleaner way is to change the input to unsigned as you mentioned... Which one do you prefer?

I submit to you that your latest submitted code is more readable than the one with shifts.

If a variable is unsigned, the Go compiler is smart enough to replace a division by 64 with a shift.

For signed values, it may not be able to do and still abide by the spec. Now, it could try to find out that the variable is always non-negative but I think we agree that the Go compiler is not smart enough to do that.

seiflotfy · 2021-01-25T16:58:54Z

Updated the benchmarks,

name                                                                  old time/op    new time/op    delta
ShortIteratorAdvanceArray/init_size_1_shortIterator_advance-12          41.8ns ± 0%    40.3ns ± 0%   -3.48%  (p=0.016 n=4+5)
ShortIteratorAdvanceArray/init_size_650_shortIterator_advance-12        77.2ns ± 1%    76.1ns ± 2%     ~     (p=0.659 n=5+5)
ShortIteratorAdvanceArray/init_size_6500_shortIterator_advance-12       86.8ns ± 2%    89.4ns ± 5%     ~     (p=0.056 n=5+5)
ShortIteratorAdvanceArray/init_size_65535_shortIterator_advance-12      90.2ns ± 0%    90.0ns ± 2%     ~     (p=0.651 n=5+5)
ShortIteratorNextArray/init_size_1_shortIterator_next-12                37.9ns ± 1%    37.4ns ± 6%     ~     (p=0.135 n=5+5)
ShortIteratorNextArray/init_size_650_shortIterator_next-12              2.35µs ± 3%    2.23µs ± 1%   -5.29%  (p=0.008 n=5+5)
ShortIteratorNextArray/init_size_6500_shortIterator_next-12             22.2µs ± 1%    22.0µs ± 3%     ~     (p=0.690 n=5+5)
ShortIteratorNextArray/init_size_65535_shortIterator_next-12             226µs ± 1%     230µs ± 9%     ~     (p=0.841 n=5+5)
ShortIteratorAdvanceBitmap/init_size_1_shortIterator_advance-12         37.9ns ± 1%    36.2ns ± 1%   -4.49%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_650_shortIterator_advance-12       42.1ns ± 1%    38.9ns ± 0%   -7.74%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_6500_shortIterator_advance-12      44.1ns ± 3%    38.8ns ± 1%  -12.14%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_65535_shortIterator_advance-12     44.9ns ±17%    38.9ns ± 0%  -13.36%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_1_shortIterator_next-12                582ns ± 2%     313ns ± 1%  -46.25%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_650_shortIterator_next-12             5.14µs ± 1%    4.36µs ± 1%  -15.24%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_6500_shortIterator_next-12            45.8µs ± 2%    40.8µs ± 1%  -10.93%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_65535_shortIterator_next-12            455µs ± 0%     416µs ± 4%   -8.45%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_1_shortIterator_advance-12        43.8ns ± 1%    42.2ns ± 2%   -3.47%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_650_shortIterator_advance-12      49.2ns ± 1%    47.5ns ± 1%   -3.45%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_6500_shortIterator_advance-12     49.2ns ± 1%    47.6ns ± 1%   -3.21%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_65535_shortIterator_advance-12    49.0ns ± 1%    47.9ns ± 1%   -2.16%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_1_shortIterator_next-12              40.4ns ± 0%    39.2ns ± 1%   -2.97%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_650_shortIterator_next-12            3.60µs ± 2%    3.52µs ± 3%     ~     (p=0.095 n=5+5)
ShortIteratorNextRuntime/init_size_6500_shortIterator_next-12           35.1µs ± 1%    34.2µs ± 0%   -2.71%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_65535_shortIterator_next-12           352µs ± 1%     345µs ± 0%   -2.04%  (p=0.008 n=5+5)

Its a minimal change. I would suggest that if you would merge this. I would continue looking for other places where things get optimised due to the compiler being not smart enough, making this initiative a series of small changes and improvements.

lemire · 2021-01-25T17:53:43Z

I love it.

Please be aware that I wasn't nitpicking to be mean. :-)

Merging.

seiflotfy · 2021-01-25T18:38:24Z

Hahaha
By all means never thought it was mean. I've been doing open source before GitHub so by all means this was delightful and educational. I am happier with this result than my first PR and already working on the next ones.
Cheers
Seif

lemire reviewed Jan 19, 2021

View reviewed changes

bitmapcontainer.go Outdated Show resolved Hide resolved

lemire reviewed Jan 19, 2021

View reviewed changes

seiflotfy force-pushed the better-bitmap branch 2 times, most recently from 5d805ec to b55300d Compare January 19, 2021 15:17

lemire reviewed Jan 19, 2021

View reviewed changes

seiflotfy commented Jan 19, 2021

View reviewed changes

lemire requested changes Jan 19, 2021

View reviewed changes

change bitmap NextSetBit iterator signature to use uint instead of int

fd57ecc

seiflotfy force-pushed the better-bitmap branch from b55300d to fd57ecc Compare January 25, 2021 16:55

lemire approved these changes Jan 25, 2021

View reviewed changes

lemire merged commit 62f3e85 into RoaringBitmap:master Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimise bitmaps #297

optimise bitmaps #297

seiflotfy commented Jan 19, 2021 •

edited

coveralls commented Jan 19, 2021 •

edited

lemire Jan 19, 2021

seiflotfy Jan 19, 2021

seiflotfy Jan 19, 2021

lemire Jan 19, 2021

lemire Jan 19, 2021

seiflotfy Jan 19, 2021

lemire commented Jan 19, 2021

lemire Jan 19, 2021

seiflotfy Jan 19, 2021

lemire Jan 19, 2021

lemire Jan 19, 2021

lemire left a comment •

edited

seiflotfy commented Jan 19, 2021

lemire commented Jan 19, 2021

seiflotfy commented Jan 19, 2021

seiflotfy commented Jan 25, 2021

lemire commented Jan 25, 2021

seiflotfy commented Jan 25, 2021

lemire commented Jan 25, 2021

seiflotfy commented Jan 25, 2021

lemire commented Jan 25, 2021

seiflotfy commented Jan 25, 2021

		@@ -314,11 +316,11 @@ func (bc *bitmapContainer) iaddReturnMinimized(i uint16) container {

		func (bc *bitmapContainer) iadd(i uint16) bool {
		x := int(i)

optimise bitmaps #297

optimise bitmaps #297

Conversation

seiflotfy commented Jan 19, 2021 • edited

coveralls commented Jan 19, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemire commented Jan 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemire left a comment • edited

Choose a reason for hiding this comment

seiflotfy commented Jan 19, 2021

lemire commented Jan 19, 2021

seiflotfy commented Jan 19, 2021

seiflotfy commented Jan 25, 2021

lemire commented Jan 25, 2021

seiflotfy commented Jan 25, 2021

lemire commented Jan 25, 2021

seiflotfy commented Jan 25, 2021

lemire commented Jan 25, 2021

seiflotfy commented Jan 25, 2021

seiflotfy commented Jan 19, 2021 •

edited

coveralls commented Jan 19, 2021 •

edited

lemire left a comment •

edited