Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bzip input stream simple vectorization #611

Conversation

konrad-kruczynski
Copy link
Contributor

Hi guys.
Here's the simple approach to get some speed-up on BZip2 decompression. Rotation loop (yy[j] = yy[j -1]) is here vectorized automatically, i.e. by means of Vector<byte> instead of e.g. Vector128<byte> tied to a specific platform. The API is available from the .NET Core/Standard 2.1, however the real gain starts in .NET Core 3.

First commits add .NET Core 3.1 as a target platform for performance tests and such a test for BZip2 decompression. The last commit is the actual vectorization. Here are some results from two Intel machines (one on Windows and one on a rather antique MacBook Air).


Without vectorization:
First machine

|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev | Ratio |
|--------------- |----------- |-------------- |--------:|---------:|---------:|------:|
| DecompressData | Job-USQLOW | .NET Core 2.1 | 4.595 s | 0.0185 s | 0.0164 s |  0.97 |
| DecompressData | Job-NMSPHA | .NET Core 3.1 | 4.730 s | 0.0490 s | 0.0434 s |  1.00 |
| DecompressData | Job-ORXZIN |        net461 | 4.723 s | 0.0314 s | 0.0279 s |  1.00 |

Second machine

|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev |
|--------------- |----------- |-------------- |--------:|---------:|---------:|
| DecompressData | Job-TBYXJE | .NET Core 2.1 | 8.177 s | 0.0607 s | 0.0538 s |
| DecompressData | Job-ZFNFSZ | .NET Core 3.1 | 8.343 s | 0.1529 s | 0.2717 s |

With vectorization:
First machine

|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev | Ratio |
|--------------- |----------- |-------------- |--------:|---------:|---------:|------:|
| DecompressData | Job-JSFVZF | .NET Core 2.1 | 4.534 s | 0.0386 s | 0.0302 s |  0.99 |
| DecompressData | Job-TNNLLN | .NET Core 3.1 | 2.335 s | 0.0252 s | 0.0211 s |  0.51 |
| DecompressData | Job-THQXFG |        net461 | 4.590 s | 0.0217 s | 0.0181 s |  1.00 |

Second machine

|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev |
|--------------- |----------- |-------------- |--------:|---------:|---------:|
| DecompressData | Job-DYBSPD | .NET Core 2.1 | 7.931 s | 0.1080 s | 0.0957 s |
| DecompressData | Job-FKJXRZ | .NET Core 3.1 | 5.131 s | 0.1007 s | 0.1273 s |

Machine details.
First machine

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.201
  [Host]     : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-JSFVZF : .NET Core 2.1.26 (CoreCLR 4.6.29812.02, CoreFX 4.6.29812.01), X64 RyuJIT
  Job-TNNLLN : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-THQXFG : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT

Second machine

BenchmarkDotNet=v0.12.1, OS=macOS 11.2.3 (20D91) [Darwin 20.3.0]
Intel Core i5-4250U CPU 1.30GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.202
  [Host]     : .NET Core 3.1.14 (CoreCLR 4.700.21.16201, CoreFX 4.700.21.16208), X64 RyuJIT
  Job-DYBSPD : .NET Core 2.1.27 (CoreCLR 4.6.29916.01, CoreFX 4.6.29916.03), X64 RyuJIT
  Job-FKJXRZ : .NET Core 3.1.14 (CoreCLR 4.700.21.16201, CoreFX 4.700.21.16208), X64 RyuJIT

The speed-up on test machines on vectorized vs non-vectorized is about 35-50%. Note that it is only observable from .NET Core 3 usage onwards (I also tested on .NET 5, results are similar).

I certify that I own, and have sufficient rights to contribute, all source code and related material intended to be compiled or integrated with the source code for the SharpZipLib open source product (the "Contribution"). My Contribution is licensed under the MIT License.

@codecov
Copy link

codecov bot commented Apr 18, 2021

Codecov Report

Merging #611 (d8811c1) into master (7ed87d1) will increase coverage by 2.30%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #611      +/-   ##
==========================================
+ Coverage   70.96%   73.27%   +2.30%     
==========================================
  Files          68       68              
  Lines       13417     8718    -4699     
==========================================
- Hits         9522     6388    -3134     
+ Misses       3895     2330    -1565     
Impacted Files Coverage Δ
.../ICSharpCode.SharpZipLib/BZip2/BZip2InputStream.cs 74.70% <100.00%> (+2.62%) ⬆️
...pZipLib/Core/Exceptions/StreamDecodingException.cs 60.00% <0.00%> (-6.67%) ⬇️
...pLib/Core/Exceptions/StreamUnsupportedException.cs 60.00% <0.00%> (-6.67%) ⬇️
.../Core/Exceptions/UnexpectedEndOfStreamException.cs 60.00% <0.00%> (-6.67%) ⬇️
src/ICSharpCode.SharpZipLib/Checksum/Adler32.cs 81.81% <0.00%> (-3.29%) ⬇️
...pCode.SharpZipLib/Zip/Compression/PendingBuffer.cs 74.13% <0.00%> (-3.14%) ⬇️
.../ICSharpCode.SharpZipLib/Core/FileSystemScanner.cs 48.73% <0.00%> (-2.91%) ⬇️
...e.SharpZipLib/Zip/Compression/InflaterDynHeader.cs 91.48% <0.00%> (-2.17%) ⬇️
...harpZipLib/Zip/Compression/Streams/OutputWindow.cs 67.74% <0.00%> (-2.07%) ⬇️
src/ICSharpCode.SharpZipLib/Core/PathFilter.cs 10.25% <0.00%> (-1.46%) ⬇️
... and 57 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ed87d1...d8811c1. Read the comment docs.

Copy link
Member

@piksel piksel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong with the implementation, but it took a while staring at it to understand what was happening. Perhaps some comments around it?

The speed improvements are really nice and badly needed for the bzip2 algo which is painfully slow (among other problems).

@Numpsy
Copy link
Contributor

Numpsy commented Apr 28, 2021

I once had a (very) brief go at using the SSE intrinsics in the deflate code, but never tried just System.Numerics.Vector on it's own (those require a NetCore3+ TFM to build though).
Nice performance gain from an isolated change though :-)

@konrad-kruczynski
Copy link
Contributor Author

Nothing wrong with the implementation, but it took a while staring at it to understand what was happening. Perhaps some comments around it?

Sure, I'l add some.

The speed improvements are really nice and badly needed for the bzip2 algo which is painfully slow (among other problems).

I would also like for some of you to confirm those speed improvements. Just to be sure that my conclusions are correct.

@konrad-kruczynski
Copy link
Contributor Author

I once had a (very) brief go at using the SSE intrinsics in the deflate code, but never tried just System.Numerics.Vector on it's own (those require a NetCore3+ TFM to build though).
Nice performance gain from an isolated change though :-)

Right. The sole purpose of choosing System.Numerics.Vector instead of going directly with SSE class was portability, e.g. it will work on ARM out of the box. Nonetheless the first version was written using SSE directly.

@Numpsy
Copy link
Contributor

Numpsy commented Apr 30, 2021

I would also like for some of you to confirm those speed improvements. Just to be sure that my conclusions are correct.

this is what I get using your branch

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.928 (2004/?/20H1)
Intel Core i7-5820K CPU 3.30GHz (Broadwell), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.300-preview.21180.15
  [Host]     : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-OUVGAT : .NET Core 2.1.26 (CoreCLR 4.6.29812.02, CoreFX 4.6.29812.01), X64 RyuJIT
  Job-MRCUFD : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-RAVTER : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT


|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev | Ratio |
|--------------- |----------- |-------------- |--------:|---------:|---------:|------:|
| DecompressData | Job-OUVGAT | .NET Core 2.1 | 4.850 s | 0.0354 s | 0.0332 s |  1.00 |
| DecompressData | Job-MRCUFD | .NET Core 3.1 | 2.457 s | 0.0063 s | 0.0056 s |  0.51 |
| DecompressData | Job-RAVTER |        net461 | 4.862 s | 0.0411 s | 0.0384 s |  1.00 |

@konrad-kruczynski
Copy link
Contributor Author

konrad-kruczynski commented May 2, 2021

I would also like for some of you to confirm those speed improvements. Just to be sure that my conclusions are correct.

this is what I get using your branch

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.928 (2004/?/20H1)
Intel Core i7-5820K CPU 3.30GHz (Broadwell), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.300-preview.21180.15
  [Host]     : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-OUVGAT : .NET Core 2.1.26 (CoreCLR 4.6.29812.02, CoreFX 4.6.29812.01), X64 RyuJIT
  Job-MRCUFD : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-RAVTER : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT


|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev | Ratio |
|--------------- |----------- |-------------- |--------:|---------:|---------:|------:|
| DecompressData | Job-OUVGAT | .NET Core 2.1 | 4.850 s | 0.0354 s | 0.0332 s |  1.00 |
| DecompressData | Job-MRCUFD | .NET Core 3.1 | 2.457 s | 0.0063 s | 0.0056 s |  0.51 |
| DecompressData | Job-RAVTER |        net461 | 4.862 s | 0.0411 s | 0.0384 s |  1.00 |

Good, this is very similar to my results. Since I was developing it on M1, I was only able to do native tests on my daughter's laptop and also included some results from friend's PC.

@piksel
Copy link
Member

piksel commented May 3, 2021

Yeah, got basically the same results:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
AMD Ryzen 7 3800X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.300-preview.21180.15
  [Host]     : .NET Core 3.1.14 (CoreCLR 4.700.21.16201, CoreFX 4.700.21.16208), X64 RyuJIT
  Job-NUXXKY : .NET Core 2.1.27 (CoreCLR 4.6.29916.01, CoreFX 4.6.29916.03), X64 RyuJIT
  Job-BNPYUR : .NET Core 3.1.14 (CoreCLR 4.700.21.16201, CoreFX 4.700.21.16208), X64 RyuJIT
  Job-FDMENM : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT


|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev | Ratio |
|--------------- |----------- |-------------- |--------:|---------:|---------:|------:|
| DecompressData | Job-NUXXKY | .NET Core 2.1 | 3.698 s | 0.0520 s | 0.0461 s |  1.00 |
| DecompressData | Job-BNPYUR | .NET Core 3.1 | 1.900 s | 0.0373 s | 0.0415 s |  0.51 |
| DecompressData | Job-FDMENM |        net461 | 3.702 s | 0.0463 s | 0.0411 s |  1.00 |

@konrad-kruczynski
Copy link
Contributor Author

Great, is there any obstacle left to have this merged soon?

@piksel piksel merged commit 1b9fcfc into icsharpcode:master May 4, 2021
HowToDoThis added a commit to HowToDoThis/SharpZipLib that referenced this pull request Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants