Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huff0: Add amd64 assembly for low tablelogs #518

Merged
merged 2 commits into from Mar 8, 2022
Merged

Conversation

klauspost
Copy link
Owner

@klauspost klauspost commented Mar 8, 2022

Solid improvement on large payload, but regression on small, so we fall back to Go for those.

benchmark                                                   old MB/s     new MB/s     speedup
BenchmarkDecompress4XNoTable/digits/100-32                  296.37       296.75       1.00x
BenchmarkDecompress4XNoTable/digits/10000-32                693.64       919.28       1.33x
BenchmarkDecompress4XNoTable/digits/262143-32               631.10       876.14       1.39x
BenchmarkDecompress4XNoTable/gettysburg/100-32              343.16       342.84       1.00x
BenchmarkDecompress4XNoTable/twain/100-32                   297.23       296.70       1.00x
BenchmarkDecompress4XNoTable/low-ent.10k/100-32             267.55       267.49       1.00x
BenchmarkDecompress4XNoTable/low-ent.10k/10000-32           752.34       916.33       1.22x
BenchmarkDecompress4XNoTable/low-ent.10k/262143-32          778.37       1021.03      1.31x
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-32     777.72       1024.48      1.32x
BenchmarkDecompress4XNoTable/case1/100-32                   307.09       302.20       0.98x
BenchmarkDecompress4XNoTable/case1/10000-32                 693.03       912.43       1.32x
BenchmarkDecompress4XNoTable/case1/262143-32                696.06       948.19       1.36x
BenchmarkDecompress4XNoTable/case2/100-32                   292.50       287.26       0.98x
BenchmarkDecompress4XNoTable/case2/10000-32                 713.88       920.04       1.29x
BenchmarkDecompress4XNoTable/case2/262143-32                724.83       979.52       1.35x
BenchmarkDecompress4XNoTable/case3/100-32                   298.72       296.64       0.99x
BenchmarkDecompress4XNoTable/case3/10000-32                 704.81       928.19       1.32x
BenchmarkDecompress4XNoTable/case3/262143-32                708.32       973.26       1.37x
BenchmarkDecompress4XNoTable/pngdata.001/100-32             285.21       272.15       0.95x
BenchmarkDecompress4XNoTable/normcount2/100-32              335.74       335.80       1.00x
BenchmarkDecompress4XNoTable/normcount2/10000-32            677.11       911.46       1.35x
BenchmarkDecompress4XNoTable/normcount2/262143-32           682.78       939.67       1.38x
BenchmarkDecompress4XNoTableTableLog8/digits-32             678.59       931.37       1.37x

Solid improvement on large payload, but regression on small.

```
benchmark                                                   old MB/s     new MB/s     speedup
BenchmarkDecompress4XNoTable/digits/100-32                  296.37       154.65       0.52x
BenchmarkDecompress4XNoTable/digits/10000-32                693.64       916.51       1.32x
BenchmarkDecompress4XNoTable/digits/262143-32               631.10       876.36       1.39x
BenchmarkDecompress4XNoTable/gettysburg/100-32              343.16       204.11       0.59x
BenchmarkDecompress4XNoTable/twain/100-32                   297.23       162.42       0.55x
BenchmarkDecompress4XNoTable/low-ent.10k/100-32             267.55       149.23       0.56x
BenchmarkDecompress4XNoTable/low-ent.10k/10000-32           752.34       912.03       1.21x
BenchmarkDecompress4XNoTable/low-ent.10k/262143-32          778.37       1023.79      1.32x
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-32     777.72       1020.92      1.31x
BenchmarkDecompress4XNoTable/case1/100-32                   307.09       158.09       0.51x
BenchmarkDecompress4XNoTable/case1/10000-32                 693.03       899.10       1.30x
BenchmarkDecompress4XNoTable/case1/262143-32                696.06       954.08       1.37x
BenchmarkDecompress4XNoTable/case2/100-32                   292.50       159.58       0.55x
BenchmarkDecompress4XNoTable/case2/10000-32                 713.88       924.82       1.30x
BenchmarkDecompress4XNoTable/case2/262143-32                724.83       975.00       1.35x
BenchmarkDecompress4XNoTable/case3/100-32                   298.72       158.89       0.53x
BenchmarkDecompress4XNoTable/case3/10000-32                 704.81       926.74       1.31x
BenchmarkDecompress4XNoTable/case3/262143-32                708.32       969.43       1.37x
BenchmarkDecompress4XNoTable/pngdata.001/100-32             285.21       153.79       0.54x
BenchmarkDecompress4XNoTable/normcount2/100-32              335.74       205.54       0.61x
BenchmarkDecompress4XNoTable/normcount2/10000-32            677.11       910.41       1.34x
BenchmarkDecompress4XNoTable/normcount2/262143-32           682.78       942.87       1.38x
BenchmarkDecompress4XNoTableTableLog8/digits-32             678.59       932.54       1.37x
```
@klauspost
Copy link
Owner Author

@WojciechMula Added asm for low tablelogs. Should be the final cases of 4X up to speed.

@WojciechMula
Copy link
Contributor

@WojciechMula Added asm for low tablelogs. Should be the final cases of 4X up to speed.

Great! TBH I wasn't sure if this 8-bit specialisation is worth rewritting --- it never appeared in perf report. It's good to be wrong sometimes. :)

@klauspost
Copy link
Owner Author

You are right, though it is a low hanging fruit, so we might as well take it.

@klauspost klauspost merged commit 275e1fc into master Mar 8, 2022
@klauspost klauspost deleted the huff0-add-8bit-asm branch March 8, 2022 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants