Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimise bitmaps #297

Merged
merged 1 commit into from Jan 25, 2021
Merged

optimise bitmaps #297

merged 1 commit into from Jan 25, 2021

Conversation

seiflotfy
Copy link
Contributor

@seiflotfy seiflotfy commented Jan 19, 2021

Shifting bits and using binary operations instead of multiplication, division and modulo yielded the following results.

name                                                                  old time/op    new time/op    delta
ShortIteratorAdvanceArray/init_size_1_shortIterator_advance-12          41.8ns ± 0%    40.3ns ± 0%   -3.48%  (p=0.016 n=4+5)
ShortIteratorAdvanceArray/init_size_650_shortIterator_advance-12        77.2ns ± 1%    76.1ns ± 2%     ~     (p=0.659 n=5+5)
ShortIteratorAdvanceArray/init_size_6500_shortIterator_advance-12       86.8ns ± 2%    89.4ns ± 5%     ~     (p=0.056 n=5+5)
ShortIteratorAdvanceArray/init_size_65535_shortIterator_advance-12      90.2ns ± 0%    90.0ns ± 2%     ~     (p=0.651 n=5+5)
ShortIteratorNextArray/init_size_1_shortIterator_next-12                37.9ns ± 1%    37.4ns ± 6%     ~     (p=0.135 n=5+5)
ShortIteratorNextArray/init_size_650_shortIterator_next-12              2.35µs ± 3%    2.23µs ± 1%   -5.29%  (p=0.008 n=5+5)
ShortIteratorNextArray/init_size_6500_shortIterator_next-12             22.2µs ± 1%    22.0µs ± 3%     ~     (p=0.690 n=5+5)
ShortIteratorNextArray/init_size_65535_shortIterator_next-12             226µs ± 1%     230µs ± 9%     ~     (p=0.841 n=5+5)
ShortIteratorAdvanceBitmap/init_size_1_shortIterator_advance-12         37.9ns ± 1%    36.2ns ± 1%   -4.49%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_650_shortIterator_advance-12       42.1ns ± 1%    38.9ns ± 0%   -7.74%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_6500_shortIterator_advance-12      44.1ns ± 3%    38.8ns ± 1%  -12.14%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_65535_shortIterator_advance-12     44.9ns ±17%    38.9ns ± 0%  -13.36%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_1_shortIterator_next-12                582ns ± 2%     313ns ± 1%  -46.25%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_650_shortIterator_next-12             5.14µs ± 1%    4.36µs ± 1%  -15.24%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_6500_shortIterator_next-12            45.8µs ± 2%    40.8µs ± 1%  -10.93%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_65535_shortIterator_next-12            455µs ± 0%     416µs ± 4%   -8.45%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_1_shortIterator_advance-12        43.8ns ± 1%    42.2ns ± 2%   -3.47%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_650_shortIterator_advance-12      49.2ns ± 1%    47.5ns ± 1%   -3.45%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_6500_shortIterator_advance-12     49.2ns ± 1%    47.6ns ± 1%   -3.21%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_65535_shortIterator_advance-12    49.0ns ± 1%    47.9ns ± 1%   -2.16%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_1_shortIterator_next-12              40.4ns ± 0%    39.2ns ± 1%   -2.97%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_650_shortIterator_next-12            3.60µs ± 2%    3.52µs ± 3%     ~     (p=0.095 n=5+5)
ShortIteratorNextRuntime/init_size_6500_shortIterator_next-12           35.1µs ± 1%    34.2µs ± 0%   -2.71%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_65535_shortIterator_next-12           352µs ± 1%     345µs ± 0%   -2.04%  (p=0.008 n=5+5)

@coveralls
Copy link

coveralls commented Jan 19, 2021

Coverage Status

Coverage remained the same at 86.986% when pulling b55300d on seiflotfy:better-bitmap into 6ec715d on RoaringBitmap:master.

bitmapcontainer.go Outdated Show resolved Hide resolved
@@ -20,8 +22,8 @@ func (bc bitmapContainer) String() string {

func newBitmapContainer() *bitmapContainer {
p := new(bitmapContainer)
size := (1 << 16) / 64
p.bitmap = make([]uint64, size, size)
size := (1 << 16) >> log2WordSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you expect this to be faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the benchmarks and the branch to show only the changes based on shifting. this might be a bit over the top but I don't think the Golang compiler is optimised for translating a division to a number of the power 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like its faster right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Both size := (1 << 16) / 64 and size := (1 << 16) >> log2WordSize. are strictly equivalent. We can check the assembly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like its faster right?

I have not seriously looked at your benchmark numbers, but we need a rationale for why it is faster. Introducing a branch, for example, may make a benchmark appear to be faster, but that can be misleading.

There may be places where introducing a shift as opposed to a division by a compile-time power of two could be faster, mostly due to signed variables...

But the Go compiler most certainly can make these optimizations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to take a deeper look this week why this might be the case. I'll see if there is a case for the signed variables. If so then I will change the code accordingly.

@seiflotfy seiflotfy force-pushed the better-bitmap branch 2 times, most recently from 5d805ec to b55300d Compare January 19, 2021 15:17
@lemire
Copy link
Member

lemire commented Jan 19, 2021

Thanks for the PR.

I am going to take a critical stance here, however... because the Go compiler will definitively compile these functions to the same output:

func f1(x uint) uint {
    return x / 64
}

func f2(x uint) uint {
    return x >> 6
}

So we need to examine these changes we care.

firstWord := firstOfRun / 64
lastWord := lastOfRun / 64
firstWord := firstOfRun >> log2WordSize
lastWord := lastOfRun >> log2WordSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here this will result in different compiled code because firstOfRun and lastOfRun are signed integers.

@@ -314,11 +316,11 @@ func (bc *bitmapContainer) iaddReturnMinimized(i uint16) container {

func (bc *bitmapContainer) iadd(i uint16) bool {
x := int(i)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this being converted into an int, why not operate directly with the uint16 or make it a uint

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is entirely possible that it is an anti-pattern from a performance point of view and something that needs to be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing it to x := uint(i) does not change the assembly in this case.

Bypassing the variable x seems to add an extra assembly instruction and is seemingly not cheaper.

Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seiflotfy

This is a well motivated PR and there are almost certainly some real gains. We want to go forward with this PR.

Before we do, please consider the following:

  1. I can do shift and masks in my head and know exactly what they mean, and presumably, you can too. For many other programmers, however, it makes the code hard to read. I submit to you that we should always try to favour code that is easy to read, unless we can demonstrate the benefits of our "hard to read code".

  2. The Go compiler is definitively smart enough to turn x/64 and x%64 into fast instructions when x is unsigned. When x is signed... well, your current PR changes the semantics by forcing the assumption that the variable is unsigned. We might as well go one step further and actually make the variable unsigned and then rely on the compiler. I submit to you that all cases except for my next point could be solved by turning the variables into unsigned integers. That is, I think you can achieve the very same result just by switching the signed integers to unsigned integers. If I am wrong, it would be interesting to know why and document it.

  3. At one precise point you turn a branchless routine into a branchy routine. I see you have reverted that. If you believe this is faster, then one should be really careful about the benchmark because if you use a benchmark where the branch is easily predicted, you will get artificially good results. You have to include a benchmark where the branch is really hard to predict... and in this case, it means that iadd should be sometimes adding a value that is already present, and sometimes not. You can look at the assembly of the current iadd function: https://godbolt.org/z/38qhjr As you can see, it is really, really cheap.

  4. Please have a look at the assembly of the functions: if the assembly is identical, then there is no need for benchmarking. And if the instruction count is reduced, then we also do not need benchmarking. That is, when tweaking purely computational code the way you do, I believe it is best to look at the assembly and just 'count the instructions'. That is, you turn a division into a shift, if you should be able to see it in the assembly, and then we don't need to benchmark, we know that a shift is cheaper. One can use godbolt or other more direct techniques to examine the assembly.

@seiflotfy
Copy link
Contributor Author

On it boss :)
I'm really excited about this since we use it intensively...

@lemire
Copy link
Member

lemire commented Jan 19, 2021

On it boss

Note that I am not trying to be difficult.

The approach you are taking is the one I took for my code contributions to https://github.com/willf/bitset/blob/master/bitset.go

So I wrote many of the same lines of code before!!!

The Java version of Roaring is also filled with shifts and masks.

But I have gained a bit of trust in the Go compiler since then, and I would write the code differently.

@seiflotfy
Copy link
Contributor Author

@seiflotfy

This is a well motivated PR and there are almost certainly some real gains. We want to go forward with this PR.

Before we do, please consider the following:

  1. I can do shift and masks in my head and know exactly what they mean, and presumably, you can too. For many other programmers, however, it makes the code hard to read. I submit to you that we should always try to favour code that is easy to read, unless we can demonstrate the benefits of our "hard to read code".

I agree I prefer readable code over non-readable. So I will try to keep it that way.

  1. The Go compiler is definitively smart enough to turn x/64 and x%64 into fast instructions when x is unsigned. When x is signed... well, your current PR changes the semantics by forcing the assumption that the variable is unsigned. We might as well go one step further and actually make the variable unsigned and then rely on the compiler. I submit to you that all cases except for my next point could be solved by turning the variables into unsigned integers. That is, I think you can achieve the very same result just by switching the signed integers to unsigned integers. If I am wrong, it would be interesting to know why and document it.

I will try to switch to unsigned integers and will document the results.

  1. At one precise point you turn a branchless routine into a branchy routine. I see you have reverted that. If you believe this is faster, then one should be really careful about the benchmark because if you use a benchmark where the branch is easily predicted, you will get artificially good results. You have to include a benchmark where the branch is really hard to predict... and in this case, it means that iadd should be sometimes adding a value that is already present, and sometimes not. You can look at the assembly of the current iadd function: https://godbolt.org/z/38qhjr As you can see, it is really, really cheap.

I want to tackle the shifting problem before getting deeper into this so I will do this in a separate branch later :)

  1. Please have a look at the assembly of the functions: if the assembly is identical, then there is no need for benchmarking. And if the instruction count is reduced, then we also do not need benchmarking. That is, when tweaking purely computational code the way you do, I believe it is best to look at the assembly and just 'count the instructions'. That is, you turn a division into a shift, if you should be able to see it in the assembly, and then we don't need to benchmark, we know that a shift is cheaper. One can use godbolt or other more direct techniques to examine the assembly.

I will look into the assembly functions and try to document my findings as part of the PR :)

@seiflotfy
Copy link
Contributor Author

is there a possibility of the argument i in func (bc *bitmapContainer) NextSetBit(i int) int being negative? I don't think so

@lemire
Copy link
Member

lemire commented Jan 25, 2021

@seiflotfy

It can return a negative value:

func (bc *bitmapContainer) NextSetBit(i int) int {
	x := i / 64
	if x >= len(bc.bitmap) {
		return -1
	}
	w := bc.bitmap[x]
	w = w >> uint(i%64)
	if w != 0 {
		return i + countTrailingZeros(w)
	}
	x++
	for ; x < len(bc.bitmap); x++ {
		if bc.bitmap[x] != 0 {
			return (x * 64) + countTrailingZeros(bc.bitmap[x])
		}
	}
	return -1
}

But if you pass it a negative value (as an input) then you almost surely have a bug. (This could be explicitly documented.)

So yeah, the input could be made unsigned.

@seiflotfy
Copy link
Contributor Author

with a little change

func (bc *bitmapContainer) NextSetBit(i int) int {
	var (
		x      = uint(i) / 64
		length = uint(len(bc.bitmap))
	)

	if x >= length {
		return -1
	}

	w := bc.bitmap[x]
	w = w >> (uint(i) % 64)
	if w != 0 {
		return i + countTrailingZeros(w)
	}
	x++

	for ; x < length; x++ {
		if bc.bitmap[x] != 0 {
			return int(x*64) + countTrailingZeros(bc.bitmap[x])
		}
	}

	return -1
}

I was able to improve the iterator benchmarks


name                                                           old time/op    new time/op    delta
ShortIteratorNextBitmap/init_size_1_shortIterator_next-12         561ns ± 1%     312ns ± 0%  -44.39%  (p=0.002 n=10+4)
ShortIteratorNextBitmap/init_size_650_shortIterator_next-12      5.02µs ± 0%    4.35µs ± 1%  -13.36%  (p=0.001 n=9+5)
ShortIteratorNextBitmap/init_size_6500_shortIterator_next-12     45.0µs ± 0%    40.6µs ± 1%   -9.81%  (p=0.001 n=10+5)
ShortIteratorNextBitmap/init_size_65535_shortIterator_next-12     449µs ± 0%     409µs ± 1%   -8.91%  (p=0.001 n=9+5)

name                                                           old alloc/op   new alloc/op   delta
ShortIteratorNextBitmap/init_size_1_shortIterator_next-12         16.0B ± 0%     16.0B ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_650_shortIterator_next-12       16.0B ± 0%     16.0B ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_6500_shortIterator_next-12      16.0B ± 0%     16.0B ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_65535_shortIterator_next-12     16.0B ± 0%     16.0B ± 0%     ~     (all equal)

name                                                           old allocs/op  new allocs/op  delta
ShortIteratorNextBitmap/init_size_1_shortIterator_next-12          1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_650_shortIterator_next-12        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_6500_shortIterator_next-12       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ShortIteratorNextBitmap/init_size_65535_shortIterator_next-12      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

A cleaner way is to change the input to unsigned as you mentioned... Which one do you prefer?

@lemire
Copy link
Member

lemire commented Jan 25, 2021

A cleaner way is to change the input to unsigned as you mentioned... Which one do you prefer?

I submit to you that your latest submitted code is more readable than the one with shifts.

If a variable is unsigned, the Go compiler is smart enough to replace a division by 64 with a shift.

For signed values, it may not be able to do and still abide by the spec. Now, it could try to find out that the variable is always non-negative but I think we agree that the Go compiler is not smart enough to do that.

@seiflotfy
Copy link
Contributor Author

Updated the benchmarks,

name                                                                  old time/op    new time/op    delta
ShortIteratorAdvanceArray/init_size_1_shortIterator_advance-12          41.8ns ± 0%    40.3ns ± 0%   -3.48%  (p=0.016 n=4+5)
ShortIteratorAdvanceArray/init_size_650_shortIterator_advance-12        77.2ns ± 1%    76.1ns ± 2%     ~     (p=0.659 n=5+5)
ShortIteratorAdvanceArray/init_size_6500_shortIterator_advance-12       86.8ns ± 2%    89.4ns ± 5%     ~     (p=0.056 n=5+5)
ShortIteratorAdvanceArray/init_size_65535_shortIterator_advance-12      90.2ns ± 0%    90.0ns ± 2%     ~     (p=0.651 n=5+5)
ShortIteratorNextArray/init_size_1_shortIterator_next-12                37.9ns ± 1%    37.4ns ± 6%     ~     (p=0.135 n=5+5)
ShortIteratorNextArray/init_size_650_shortIterator_next-12              2.35µs ± 3%    2.23µs ± 1%   -5.29%  (p=0.008 n=5+5)
ShortIteratorNextArray/init_size_6500_shortIterator_next-12             22.2µs ± 1%    22.0µs ± 3%     ~     (p=0.690 n=5+5)
ShortIteratorNextArray/init_size_65535_shortIterator_next-12             226µs ± 1%     230µs ± 9%     ~     (p=0.841 n=5+5)
ShortIteratorAdvanceBitmap/init_size_1_shortIterator_advance-12         37.9ns ± 1%    36.2ns ± 1%   -4.49%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_650_shortIterator_advance-12       42.1ns ± 1%    38.9ns ± 0%   -7.74%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_6500_shortIterator_advance-12      44.1ns ± 3%    38.8ns ± 1%  -12.14%  (p=0.008 n=5+5)
ShortIteratorAdvanceBitmap/init_size_65535_shortIterator_advance-12     44.9ns ±17%    38.9ns ± 0%  -13.36%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_1_shortIterator_next-12                582ns ± 2%     313ns ± 1%  -46.25%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_650_shortIterator_next-12             5.14µs ± 1%    4.36µs ± 1%  -15.24%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_6500_shortIterator_next-12            45.8µs ± 2%    40.8µs ± 1%  -10.93%  (p=0.008 n=5+5)
ShortIteratorNextBitmap/init_size_65535_shortIterator_next-12            455µs ± 0%     416µs ± 4%   -8.45%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_1_shortIterator_advance-12        43.8ns ± 1%    42.2ns ± 2%   -3.47%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_650_shortIterator_advance-12      49.2ns ± 1%    47.5ns ± 1%   -3.45%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_6500_shortIterator_advance-12     49.2ns ± 1%    47.6ns ± 1%   -3.21%  (p=0.008 n=5+5)
ShortIteratorAdvanceRuntime/init_size_65535_shortIterator_advance-12    49.0ns ± 1%    47.9ns ± 1%   -2.16%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_1_shortIterator_next-12              40.4ns ± 0%    39.2ns ± 1%   -2.97%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_650_shortIterator_next-12            3.60µs ± 2%    3.52µs ± 3%     ~     (p=0.095 n=5+5)
ShortIteratorNextRuntime/init_size_6500_shortIterator_next-12           35.1µs ± 1%    34.2µs ± 0%   -2.71%  (p=0.008 n=5+5)
ShortIteratorNextRuntime/init_size_65535_shortIterator_next-12           352µs ± 1%     345µs ± 0%   -2.04%  (p=0.008 n=5+5)

Its a minimal change. I would suggest that if you would merge this. I would continue looking for other places where things get optimised due to the compiler being not smart enough, making this initiative a series of small changes and improvements.

@lemire
Copy link
Member

lemire commented Jan 25, 2021

I love it.

Please be aware that I wasn't nitpicking to be mean. :-)

Merging.

@lemire lemire merged commit 62f3e85 into RoaringBitmap:master Jan 25, 2021
@seiflotfy
Copy link
Contributor Author

Hahaha
By all means never thought it was mean. I've been doing open source before GitHub so by all means this was delightful and educational. I am happier with this result than my first PR and already working on the next ones.
Cheers
Seif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants