Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huff0: translate asm implementation into avo program #543

Closed
wants to merge 1 commit into from

Conversation

WojciechMula
Copy link
Contributor

@WojciechMula WojciechMula commented Mar 24, 2022

Currently register allocation fails for some reason. Fixes #529

@klauspost
Copy link
Owner

Some notes:

  • Don't pre-allocate and re-use temp vars. Instead "allocate" them just as you need them. In your code, you keep "temp" allocated across calls since you pass it as a parameter. This gives the allocator more space to work.
  • Dereference the bitreader between calls, so you don't need to keep it alive all the time.
  • Made bmi/non-bmi versions. We already have the detection code and don't require the cleanup.
  • Moved "off" to memory. Used the return index which is already allocated, which gets updated inplace.
  • Moved RCX/CL references to a variable to make it more clear what is happening.

The 8H is limited to just a few registers (which you probably know), which avo needs to know up-front. Maybe there is a bug in avo. I used manual assignment for those.

"Compiling" version: https://gist.github.com/klauspost/8f8dbbd9745662464dfac37d00cbd5f6

@WojciechMula WojciechMula force-pushed the avo-decode4x branch 2 times, most recently from 418c6e8 to 45efab9 Compare March 25, 2022 12:23
@WojciechMula
Copy link
Contributor Author

There are significant regressions. The asm code cached more values in GPRs.

benchmark                                                   old ns/op     new ns/op     delta
BenchmarkDecompress4XNoTable/digits/100-16                  424           421           -0.59%
BenchmarkDecompress4XNoTable/digits/10000-16                11797         53166         +350.67%
BenchmarkDecompress4XNoTable/digits/262143-16               316076        1393336       +340.82%
BenchmarkDecompress4XNoTable/gettysburg/100-16              326           322           -1.23%
BenchmarkDecompress4XNoTable/gettysburg/10000-16            11727         54531         +365.00%
BenchmarkDecompress4XNoTable/gettysburg/262143-16           328128        1462261       +345.64%
BenchmarkDecompress4XNoTable/twain/100-16                   416           413           -0.86%
BenchmarkDecompress4XNoTable/twain/10000-16                 11762         56049         +376.53%
BenchmarkDecompress4XNoTable/twain/262143-16                399616        1521951       +280.85%
BenchmarkDecompress4XNoTable/low-ent.10k/100-16             445           437           -1.91%
BenchmarkDecompress4XNoTable/low-ent.10k/10000-16           11643         44926         +285.86%
BenchmarkDecompress4XNoTable/low-ent.10k/262143-16          273885        1175472       +329.18%
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-16     274214        1176740       +329.13%
BenchmarkDecompress4XNoTable/case1/100-16                   410           409           -0.39%
BenchmarkDecompress4XNoTable/case1/10000-16                 11863         47098         +297.02%
BenchmarkDecompress4XNoTable/case1/262143-16                298821        1226254       +310.36%
BenchmarkDecompress4XNoTable/case2/100-16                   421           416           -1.33%
BenchmarkDecompress4XNoTable/case2/10000-16                 11657         48478         +315.87%
BenchmarkDecompress4XNoTable/case2/262143-16                291177        1272254       +336.93%
BenchmarkDecompress4XNoTable/case3/100-16                   408           407           -0.17%
BenchmarkDecompress4XNoTable/case3/10000-16                 11664         49914         +327.93%
BenchmarkDecompress4XNoTable/case3/262143-16                294305        1297013       +340.70%
BenchmarkDecompress4XNoTable/pngdata.001/100-16             439           429           -2.23%
BenchmarkDecompress4XNoTable/pngdata.001/10000-16           11965         45253         +278.21%
BenchmarkDecompress4XNoTable/pngdata.001/262143-16          307526        1215161       +295.14%
BenchmarkDecompress4XNoTable/normcount2/100-16              337           329           -2.17%
BenchmarkDecompress4XNoTable/normcount2/10000-16            11978         50669         +323.02%
BenchmarkDecompress4XNoTable/normcount2/262143-16           303912        1330331       +337.74%
BenchmarkDecompress4XNoTableTableLog8/digits-16             113774        531650        +367.29%
BenchmarkDecompress4XTable/digits-16                        114840        532410        +363.61%
BenchmarkDecompress4XTable/gettysburg-16                    3447          9315          +170.23%
BenchmarkDecompress4XTable/twain-16                         402565        1520643       +277.74%
BenchmarkDecompress4XTable/low-ent.10k-16                   43690         179044        +309.81%
BenchmarkDecompress4XTable/superlow-ent-10k-16              12669         47572         +275.50%
BenchmarkDecompress4XTable/case1-16                         2078          2102          +1.15%
BenchmarkDecompress4XTable/case2-16                         2037          2064          +1.33%
BenchmarkDecompress4XTable/case3-16                         2036          2040          +0.20%
BenchmarkDecompress4XTable/pngdata.001-16                   58369         236666        +305.47%
BenchmarkDecompress4XTable/normcount2-16                    1437          1432          -0.35%

benchmark                                                   old MB/s     new MB/s     speedup
BenchmarkDecompress4XNoTable/digits/100-16                  236.15       237.51       1.01x
BenchmarkDecompress4XNoTable/digits/10000-16                847.70       188.09       0.22x
BenchmarkDecompress4XNoTable/digits/262143-16               829.37       188.14       0.23x
BenchmarkDecompress4XNoTable/gettysburg/100-16              307.23       311.03       1.01x
BenchmarkDecompress4XNoTable/gettysburg/10000-16            852.73       183.38       0.22x
BenchmarkDecompress4XNoTable/gettysburg/262143-16           798.90       179.27       0.22x
BenchmarkDecompress4XNoTable/twain/100-16                   240.16       242.22       1.01x
BenchmarkDecompress4XNoTable/twain/10000-16                 850.20       178.41       0.21x
BenchmarkDecompress4XNoTable/twain/262143-16                655.99       172.24       0.26x
BenchmarkDecompress4XNoTable/low-ent.10k/100-16             224.64       228.97       1.02x
BenchmarkDecompress4XNoTable/low-ent.10k/10000-16           858.90       222.59       0.26x
BenchmarkDecompress4XNoTable/low-ent.10k/262143-16          957.13       223.01       0.23x
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-16     955.98       222.77       0.23x
BenchmarkDecompress4XNoTable/case1/100-16                   243.67       244.60       1.00x
BenchmarkDecompress4XNoTable/case1/10000-16                 842.98       212.32       0.25x
BenchmarkDecompress4XNoTable/case1/262143-16                877.26       213.78       0.24x
BenchmarkDecompress4XNoTable/case2/100-16                   237.31       240.51       1.01x
BenchmarkDecompress4XNoTable/case2/10000-16                 857.84       206.28       0.24x
BenchmarkDecompress4XNoTable/case2/262143-16                900.29       206.05       0.23x
BenchmarkDecompress4XNoTable/case3/100-16                   245.01       245.47       1.00x
BenchmarkDecompress4XNoTable/case3/10000-16                 857.37       200.35       0.23x
BenchmarkDecompress4XNoTable/case3/262143-16                890.72       202.11       0.23x
BenchmarkDecompress4XNoTable/pngdata.001/100-16             227.91       233.09       1.02x
BenchmarkDecompress4XNoTable/pngdata.001/10000-16           835.80       220.98       0.26x
BenchmarkDecompress4XNoTable/pngdata.001/262143-16          852.42       215.73       0.25x
BenchmarkDecompress4XNoTable/normcount2/100-16              297.01       303.54       1.02x
BenchmarkDecompress4XNoTable/normcount2/10000-16            834.86       197.36       0.24x
BenchmarkDecompress4XNoTable/normcount2/262143-16           862.56       197.05       0.23x
BenchmarkDecompress4XNoTableTableLog8/digits-16             878.96       188.10       0.21x
BenchmarkDecompress4XTable/digits-16                        870.80       187.83       0.22x
BenchmarkDecompress4XTable/gettysburg-16                    449.10       166.18       0.37x
BenchmarkDecompress4XTable/twain-16                         651.18       172.39       0.26x
BenchmarkDecompress4XTable/low-ent.10k-16                   915.55       223.41       0.24x
BenchmarkDecompress4XTable/superlow-ent-10k-16              828.81       220.72       0.27x
BenchmarkDecompress4XTable/case1-16                         26.46        26.17        0.99x
BenchmarkDecompress4XTable/case2-16                         22.09        21.80        0.99x
BenchmarkDecompress4XTable/case3-16                         23.57        23.53        1.00x
BenchmarkDecompress4XTable/pngdata.001-16                   877.18       216.34       0.25x
BenchmarkDecompress4XTable/normcount2-16                    60.54        60.75        1.00x

@WojciechMula WojciechMula changed the title [skip ci] huff0: translate asm implementation into avo program huff0: translate asm implementation into avo program Mar 25, 2022
@klauspost
Copy link
Owner

@WojciechMula There must be something else. There is no way reading/writing L1 cached values will slow down that much, maybe a percent or two.

It will be a couple of days before I can look at this.

@klauspost klauspost self-requested a review March 28, 2022 10:33
@WojciechMula
Copy link
Contributor Author

I managed to fix obvious mistakes, but still, there are regressions. I will investigate it further, just dumping the current state.

benchmark                                                   old ns/op     new ns/op     delta
BenchmarkDecompress4XNoTable/digits/100-16                  424           418           -1.20%
BenchmarkDecompress4XNoTable/digits/10000-16                11797         19903         +68.71%
BenchmarkDecompress4XNoTable/digits/262143-16               316076        509543        +61.21%
BenchmarkDecompress4XNoTable/gettysburg/100-16              326           321           -1.32%
BenchmarkDecompress4XNoTable/gettysburg/10000-16            11727         15252         +30.06%
BenchmarkDecompress4XNoTable/gettysburg/262143-16           328128        404332        +23.22%
BenchmarkDecompress4XNoTable/twain/100-16                   416           413           -0.91%
BenchmarkDecompress4XNoTable/twain/10000-16                 11762         15229         +29.48%
BenchmarkDecompress4XNoTable/twain/262143-16                399616        454245        +13.67%
BenchmarkDecompress4XNoTable/low-ent.10k/100-16             445           437           -1.84%
BenchmarkDecompress4XNoTable/low-ent.10k/10000-16           11643         20342         +74.71%
BenchmarkDecompress4XNoTable/low-ent.10k/262143-16          273885        510006        +86.21%
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-16     274214        509801        +85.91%
BenchmarkDecompress4XNoTable/case1/100-16                   410           408           -0.63%
BenchmarkDecompress4XNoTable/case1/10000-16                 11863         19904         +67.78%
BenchmarkDecompress4XNoTable/case1/262143-16                298821        511902        +71.31%
BenchmarkDecompress4XNoTable/case2/100-16                   421           415           -1.59%
BenchmarkDecompress4XNoTable/case2/10000-16                 11657         19992         +71.50%
BenchmarkDecompress4XNoTable/case2/262143-16                291177        511883        +75.80%
BenchmarkDecompress4XNoTable/case3/100-16                   408           408           -0.07%
BenchmarkDecompress4XNoTable/case3/10000-16                 11664         19853         +70.21%
BenchmarkDecompress4XNoTable/case3/262143-16                294305        510981        +73.62%
BenchmarkDecompress4XNoTable/pngdata.001/100-16             439           428           -2.53%
BenchmarkDecompress4XNoTable/pngdata.001/10000-16           11965         15898         +32.87%
BenchmarkDecompress4XNoTable/pngdata.001/262143-16          307526        404434        +31.51%
BenchmarkDecompress4XNoTable/normcount2/100-16              337           330           -2.14%
BenchmarkDecompress4XNoTable/normcount2/10000-16            11978         19868         +65.87%
BenchmarkDecompress4XNoTable/normcount2/262143-16           303912        511161        +68.19%
BenchmarkDecompress4XNoTableTableLog8/digits-16             113774        195346        +71.70%
BenchmarkDecompress4XTable/digits-16                        114840        196381        +71.00%
BenchmarkDecompress4XTable/gettysburg-16                    3447          3941          +14.33%
BenchmarkDecompress4XTable/twain-16                         402565        457229        +13.58%
BenchmarkDecompress4XTable/low-ent.10k-16                   43690         79427         +81.80%
BenchmarkDecompress4XTable/superlow-ent-10k-16              12669         21837         +72.37%
BenchmarkDecompress4XTable/case1-16                         2078          2083          +0.24%
BenchmarkDecompress4XTable/case2-16                         2037          2026          -0.54%
BenchmarkDecompress4XTable/case3-16                         2036          2024          -0.59%
BenchmarkDecompress4XTable/pngdata.001-16                   58369         79526         +36.25%
BenchmarkDecompress4XTable/normcount2-16                    1437          1432          -0.35%

@klauspost
Copy link
Owner

Hint: Disabling BMI2 brings back most of the performance. 8 bit version also appears worse.

@klauspost
Copy link
Owner

It seems like BMI just isn't a gain here.

It seems we have enough regs to move this back out of the main loop:

	br0 := Dereference(Param("pbr0"))
	br1 := Dereference(Param("pbr1"))
	br2 := Dereference(Param("pbr2"))
	br3 := Dereference(Param("pbr3"))

	Comment("Main loop")

Also 8 bit variant is a tiny bit slower here. Does it improve things on your side?

@klauspost
Copy link
Owner

If you want a good speedup, decode directly to out and avoid the memcopy on every loop.

Technically you don't even need to return between loops with that.

@WojciechMula
Copy link
Contributor Author

It seems like BMI just isn't a gain here.

I need to investigate it, there might be some strange code generated.

It seems we have enough regs to move this back out of the main loop:

	br0 := Dereference(Param("pbr0"))
	br1 := Dereference(Param("pbr1"))
	br2 := Dereference(Param("pbr2"))
	br3 := Dereference(Param("pbr3"))

	Comment("Main loop")

True, but we had to introduce these dereferences due to reg. allocation failure.

Also 8 bit variant is a tiny bit slower here. Does it improve things on your side?

Will check it.

Copy link
Contributor

@mmcloughlin mmcloughlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you still having problems with the register allocator?

huff0/_generate/gen.go Outdated Show resolved Hide resolved
Comment on lines +31 to +23
Constraint(buildtags.Not("appengine").ToConstraint())
Constraint(buildtags.Not("noasm").ToConstraint())
Constraint(buildtags.Term("gc").ToConstraint())
Constraint(buildtags.Not("noasm").ToConstraint())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ConstraintExpr and just provide a string (in the old +build syntax).

huff0/_generate/gen.go Outdated Show resolved Hide resolved
Comment on lines 239 to 244
peekBits := GP64()
buffer := GP64()
table := GP64()

Comment("Preload values")
{
Load(Param("peekBits"), peekBits)
Load(Param("buf"), buffer)
Load(Param("tbl"), table)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Load returns the register. You can do it like this:

peekBits := Load(Param("peekBits"), GP64())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw this in the examples. For me having separate allocation and the use is a bit cleaner.

@WojciechMula
Copy link
Contributor Author

Are you still having problems with the register allocator?

First of all, thank you for such a great tool!

I haven't worked on this PR recently, I'm getting back to it this week.

@mmcloughlin
Copy link
Contributor

I haven't worked on this PR recently, I'm getting back to it this week.

Let me know. As @klauspost alluded to sometimes it's just a matter of writing the code to limit the number of live variables. However, there's a chance there are bugs/inefficiencies in avo's liveness analysis or register allocator.

@WojciechMula
Copy link
Contributor Author

So my today's findings are quite strange. BMI2 functions are significantly slower, despite the fact I managed to keep all the values in registers as in the master's version now. I spent some time comparing the generated assembly with the current version and couldn't notice any differences (other than different registers used). I'm continuing tomorrow.

@mmcloughlin Would you please explain how I can convert an arbitrary pointer stored in a reg into a Component? I was unable to find any example and I'm a newbie in terms of Go types magic. Currently I hardcoded offsets (https://github.com/klauspost/compress/pull/543/commits/0105d90cc7ba160bdd0f6ddeffdf9b7a09645241#diff-ea089d652c358a82c9850f63ac418edf3f1869e9686199a3d35a5c44eb4a430a#L99), that's ugly.

@mmcloughlin
Copy link
Contributor

Would you please explain how I can convert an arbitrary pointer stored in a reg into a Component?

Sorry, can you elaborate on what you're trying to do?

@WojciechMula
Copy link
Contributor Author

Would you please explain how I can convert an arbitrary pointer stored in a reg into a Component?

Sorry, can you elaborate on what you're trying to do?

Oh, sorry for not being precise. When we have a pointer to a struct as a parameter, then reading fields is obvious:

s := Dereference(Param("structPtr"))
x := Load(s.Field("something"), GP64())

But we read the pointer directly:

ptr ;= GP64()
MOVQ(Partam("structPtr"), ptr)

And want to interpret that bare pointer as structure, to use Load:

s := ???(ptr, "likely name of structure")
x := Load(s.Field("something"), GP64())

@klauspost
Copy link
Owner

@WojciechMula Do you mean as we did in zstd: https://github.com/klauspost/compress/blob/master/zstd/_generate/gen.go#L165-L166

		ctx := Dereference(Param("ctx"))
		Load(ctx.Field("llState"), llState)

@WojciechMula
Copy link
Contributor Author

@klauspost No, something else. OK, this is an actual snippet from this branch:

const bitReader_in = 0
const bitReader_off = bitReader_in + 3*8 // {ptr, len, cap}
const bitReader_value = bitReader_off + 8
const bitReader_bitsRead = bitReader_value + 8

func (d decompress4x) decodeTwoValues(id int, br, peekBits, table, buffer, off reg.GPVirtual, out, exhausted reg.GPPhysical) {
	brOffset := GP64()
	brBitsRead := GP64()
	brValue := GP64()

	MOVQ(Mem{Base: br, Disp: bitReader_off}, brOffset)
	MOVQ(Mem{Base: br, Disp: bitReader_value}, brValue)
	MOVBQZX(Mem{Base: br, Disp: bitReader_bitsRead}, brBitsRead)
        // snip
}

The br is an untyped pointer. I would like to convert that pointer into a value of type gotypes.Components and then be able to use Load function and friends. Now, as you see, I hardcoded offsets into struct.

@klauspost
Copy link
Owner

Ah, ok. Don't know how to do that.

BMI2 functions are significantly slower

I find the same, so shouldn't enable them. Let's analyze with numbers from https://github.com/InstLatx64/InstLatx64

	if d.bmi2 {
		SHRXQ(peekBits, brValue, val.As64()) // val = (value >> peek_bits) & mask
	} else {
		MOVQ(brValue, val.As64())
		MOVQ(peekBits, CX.As64())
		SHRQ(CX, val.As64()) // val = (value >> peek_bits) & mask
	}

BMI: SHRX r64, r64, r64 L: 0.26ns= 1.0c T: 0.11ns= 0.42c (Zen2) 0.78ns= 1.0c T: 0.39ns= 0.50c (Ice lake)

X86: MOVs are just register renames, SHR r64, cl L: 0.26ns= 1.0c T: 0.09ns= 0.35c (Zen2), L: 0.83ns= 1.1c T: 0.78ns= 1.02c (Icelake)

So these can be expected to perform the same.

	MOVBQZX(v.As8(), CX.As64())
	if d.bmi2 {
		SHLXQ(v.As64(), brValue, brValue) // value <<= n
	} else {
		SHLQ(CX, brValue) // value <<= n
	}

Should be the same again, though BMI should doesn't have to wait for CX if it can't be executed ahead-of-time (which I expect it would).

So I don't really see this being the cause. Either way, I trust the benchmarks in this. Let's keep BMI disabled.

@WojciechMula
Copy link
Contributor Author

WojciechMula commented Apr 29, 2022

I'm comparing the old assembly with avo generated one, instruction by instruction. For the non-BMI version the assembly is almost identical with a bit different encoding in a few cases due to using different registers. I'm working now on BMI versions, and will share the results soon. Exactly the same outcome, just different registers got allocted by avo.

@klauspost
Copy link
Owner

@WojciechMula The "old" is not using BMI unless you set the v3 env var for Go 1.18

@WojciechMula
Copy link
Contributor Author

I didn't figure out the reasons for regressions. The assembly code generated by avo is identical to the handwritten procedure. Differences are IMHO negligible, they cannot cause 20-30% slowdowns.
TBH I'm running out of ideas. I will continue the next week with a fresh mind. :)

@klauspost
Copy link
Owner

klauspost commented Apr 29, 2022

I see at most a 1-3% regression. Nothing to really worry about.

@klauspost
Copy link
Owner

Instead if you can eliminate the memcopies that would make a much bigger diff:

		copy(out, buf[0][:])
		copy(out[dstEvery:], buf[1][:])
		copy(out[dstEvery*2:], buf[2][:])
		copy(out[dstEvery*3:], buf[3][:])

Instead of decoding to buf, you can send 4 slices to write to. If you keep track of the bytes written you can get rid of the looping altogether.

You could also look into branchless filling similar to #550 - I can't remember if I already looked at this.

It is not that I don't care about 2%, but I think there are bigger fish to catch.

@WojciechMula
Copy link
Contributor Author

I see at most a 1-3% regression. Nothing to really worry about.

Hm, on my Ice Lake machines there are 15-20% regressions in a few cases. But if you are happy with the current shape, we may merge this PR. Then I'll eliminate mem copying as you suggested.

@klauspost
Copy link
Owner

Running this branch, I get these numbers on 4XNoTable:

λ benchcmp before.txt after.txt                                                               
benchmark                                                   old ns/op     new ns/op     delta 
BenchmarkDecompress4XNoTable/digits/100-32                  334           334           +0.21%
BenchmarkDecompress4XNoTable/digits/10000-32                10835         11017         +1.68%
BenchmarkDecompress4XNoTable/digits/262143-32               303585        310422        +2.25%
BenchmarkDecompress4XNoTable/gettysburg/100-32              285           285           +0.04%
BenchmarkDecompress4XNoTable/gettysburg/10000-32            11393         11480         +0.76%
BenchmarkDecompress4XNoTable/gettysburg/262143-32           327973        333221        +1.60%
BenchmarkDecompress4XNoTable/twain/100-32                   331           332           +0.36%
BenchmarkDecompress4XNoTable/twain/10000-32                 11458         11453         -0.04%
BenchmarkDecompress4XNoTable/twain/262143-32                374970        386719        +3.13%
BenchmarkDecompress4XNoTable/low-ent.10k/100-32             367           372           +1.17%
BenchmarkDecompress4XNoTable/low-ent.10k/10000-32           10812         10956         +1.33%
BenchmarkDecompress4XNoTable/low-ent.10k/262143-32          256684        260314        +1.41%
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-32     256839        261779        +1.92%
BenchmarkDecompress4XNoTable/case1/100-32                   318           320           +0.72%
BenchmarkDecompress4XNoTable/case1/10000-32                 10803         11021         +2.02%
BenchmarkDecompress4XNoTable/case1/262143-32                277377        280121        +0.99%
BenchmarkDecompress4XNoTable/case2/100-32                   345           342           -1.10%
BenchmarkDecompress4XNoTable/case2/10000-32                 10659         10870         +1.98%
BenchmarkDecompress4XNoTable/case2/262143-32                268723        274144        +2.02%
BenchmarkDecompress4XNoTable/case3/100-32                   333           333           +0.15%
BenchmarkDecompress4XNoTable/case3/10000-32                 10737         10806         +0.64%
BenchmarkDecompress4XNoTable/case3/262143-32                272268        276636        +1.60%
BenchmarkDecompress4XNoTable/pngdata.001/100-32             361           361           -0.11%
BenchmarkDecompress4XNoTable/pngdata.001/10000-32           11583         11683         +0.86%
BenchmarkDecompress4XNoTable/pngdata.001/262143-32          306257        313848        +2.48%
BenchmarkDecompress4XNoTable/normcount2/100-32              287           288           +0.49%
BenchmarkDecompress4XNoTable/normcount2/10000-32            10832         11073         +2.22%
BenchmarkDecompress4XNoTable/normcount2/262143-32           279908        283876        +1.42%
BenchmarkDecompress4XNoTableTableLog8/digits-32             107990        109990        +1.85%

@klauspost
Copy link
Owner

Just tried removing the "peekBits" variable shift and creating a version that has 9,10 and 11 bits of fixed peek.

That was 200MB/s worse than the variable shift for relevant benchmarks. THAT is a surprise.

@WojciechMula
Copy link
Contributor Author

Just tried removing the "peekBits" variable shift and creating a version that has 9,10 and 11 bits of fixed peek.

That was 200MB/s worse than the variable shift for relevant benchmarks. THAT is a surprise.

It's weird. I'm starting to suspect that there's something odd in the tests.

@klauspost
Copy link
Owner

klauspost commented May 3, 2022

You are welcome to check, but I am pretty sure it holds up. It just shows you can never trust intuition, and what "makes sense" to be true, and always benchmark every small change. (see edit)

Simplifying and comparing these 3:

	if true {
		MOVQ(U32(64-d.nBits), CX.As64())
		MOVQ(brValue, val.As64())
		SHRQ(CX, val.As64()) // val = (value >> peek_bits) & mask
	} else if false {
		mask := GP64()
		MOVQ(U32(64-d.nBits|(d.nBits<<8)), mask)
		BEXTRQ(mask, brValue, val.As64())
	} else {
		MOVQ(brValue, val.As64())
		SHRQ(U8(64-d.nBits), val.As64()) // val = (value >> peek_bits) & mask
	}

The first is by far the fastest:

BenchmarkDecompress4XNoTable/gettysburg/10000-32          106276             11207 ns/op         892.32 MB/s
BenchmarkDecompress4XNoTable/gettysburg/10000-32           86511             13541 ns/op         738.50 MB/s
BenchmarkDecompress4XNoTable/gettysburg/10000-32           87301             13686 ns/op         730.69 MB/s

The existing code gave 857.25 MB/s.

EDIT: Actually it seems I should have picked this up. Looking at Zen 2 timings:

236 AMD64           :SHR r64, imm8                         L:   0.29ns=  1.0c  T:   0.14ns=  0.46c
240 AMD64           :SHR r64, cl                           L:   0.29ns=  1.0c  T:   0.11ns=  0.37c

There are more shifting pipelines with variable shifts - see how throughput is a bit higher. It may be able to do 2 fixed shifts/cycle or 3 variable shifts per cycle, indicating different pipelines are used. It could also be the pipelines with fixed shifts are already doing work.

EDIT 2:

Seems like Intel (Tiger Lake here) has the opposite:

236 AMD64               :SHR r64, imm8                         L:   0.41ns=  1.0c  T:   0.21ns=  0.51c
240 AMD64               :SHR r64, cl                           L:   0.45ns=  1.1c  T:   0.41ns=  1.00c

Fixed shifts 2x throughput than variable...

@klauspost
Copy link
Owner

klauspost commented May 3, 2022

@WojciechMula I factored out the bit filling code and made versions for 9,10 and 11 bits peek.

https://gist.github.com/klauspost/617e149f31f8967bc184f5a48c3834f4

This is the same speed, but mainly for future extensions. I would like to unconditionally fill 56 bits, so we have enough for 4x11 but I haven't gotten it to work yet.

There should also be less register use.

@WojciechMula
Copy link
Contributor Author

@klauspost Great, and thank you for the answers. Yeah, I keep forgetting that intuition too often does not match reality. :)

Let me recheck the code on Ice Lake once more -- I'm giving myself 2-3 hours. If I don't find any obvious mistake, I propose to merge this code. And then I will pick #576. It may give a significant boost, as you wrote.

@WojciechMula
Copy link
Contributor Author

This PR got replaced by #577.

@WojciechMula WojciechMula deleted the avo-decode4x branch May 12, 2022 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Translate assembly text templates into avo programs
3 participants