Threaded rayon #246

HeroicKatora · 2022-05-14T16:57:04Z

Intends to address the issue of effectively serialized decode (#245), where all
tasks end up being executed on the main thread instead of being
distributed into other workers.

We had neglected that most work is scheduled in sync (apppend_row such
as in decoder.rs:903 instead of apppend_rows). This meant most were
executed with an immediate strategy.

The change pushes all items into a bounded task queue that is emptied
and actively worked on when it reaches a capacity maximum, as well as
when any component result is requested. This is in contrast to
std::multithreading where items are worked on while decoding is in
progress but task queueing itself has more overhead.

cc: @Shnatsel

These are being emulated, with instruction count adding significantly to the runtime. As it stands this is >20 minutes for a full run which is the bottleneck of our CI times. It should not be necessary to run these with debug flags. The main reason to have the at all, the arch/aarch64 inline SIMD code, is not affected by flags in any case.

Intends to address the issue of effectively serialized sort, where all tasks end up being executed on the main thread instead of being distributed into other workers. We had neglected that most work is scheduled in sync (apppend_row such as in decoder.rs:903 instead of apppend_rows). This meant most were executed with an immediate strategy. The change pushes all items into a bounded task queue that is emptied and actively worked on when it reaches a capacity maximum, as well as when any component result is requested. This is in contrast to std::multithreading where items are worked on while decoding is in progress but task queueing itself has more overhead. decode a 512x512 JPEG time: [1.7317 ms 1.7352 ms 1.7388 ms] change: [-22.895% -22.646% -22.351%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 4 (4.00%) high mild 1 (1.00%) high severe decode a 512x512 progressive JPEG time: [4.7252 ms 4.7364 ms 4.7491 ms] change: [-15.641% -15.349% -15.052%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe decode a 512x512 grayscale JPEG time: [873.48 us 877.71 us 882.83 us] change: [-11.470% -10.764% -10.041%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 2 (2.00%) low mild 9 (9.00%) high mild 2 (2.00%) high severe extract metadata from an image time: [1.1033 us 1.1066 us 1.1099 us] change: [-11.608% -9.8026% -8.3965%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) low severe 1 (1.00%) low mild 1 (1.00%) high mild Benchmarking decode a 3072x2048 RGB Lossless JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 36.6s, or reduce sample count to 10. decode a 3072x2048 RGB Lossless JPEG time: [363.07 ms 363.66 ms 364.27 ms] change: [+0.0997% +0.3692% +0.6323%] (p = 0.01 < 0.05) Change within noise threshold. Found 6 outliers among 100 measurements (6.00%) 5 (5.00%) high mild 1 (1.00%) high severe Running unittests (target/release/deps/large_image-0e61f2c2f07410bd) Gnuplot not found, using plotters backend decode a 2268x1512 JPEG time: [28.755 ms 28.879 ms 29.021 ms] change: [-5.7714% -4.9308% -4.0969%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe

Shnatsel · 2022-05-14T17:26:35Z

According to the profile, this did actually parallelize things, but now it requires Huffman decoding to complete before Rayon kicks in.

~~The performance results are a toss-up: on some images this approach is 20% faster; on others it is 25% slower.~~ See corrected data below.

According to the profile, Huffman decoding completes before we start performing IDCT - in contrast to the rayonless version. Is there perhaps a way to combine the pipelining and Rayon's low-overhead scheduling? Say, spawn worker threads up front and have them pick tasks off the task queue immediately?

HeroicKatora · 2022-05-14T17:33:37Z

Say, spawn worker threads up front and have them pick tasks off the task queue immediately?

We can't, or at the very least I don't see how. rayon doesn't offer something a JoinHandle. We can either wait on the task immediately in the method that schedules it (rayon::scope etc.) or we schedule it at a later date. It sucks.

We can try adjusting the constants here, or adding specifically some serialization points though.

Shnatsel · 2022-05-14T17:38:40Z

Wait, I have actually measured it incorrectly. I need to turn off all the OTHER parallelization that Rayon does to make an apples-to-apples comparison. It might very well be that this approach with IDCT on Rayon is always slower than the DIY parallelization.

HeroicKatora · 2022-05-14T17:49:01Z

Actually, I have a devious and plainly stupid idea as well regarding join handles. I had assumed that scopes can only end in code by stepping out (side note: that was a longheld error in my thought pattern, in an totally unrelated topic as well) but with async fn that is no longer true. If we could convince rayon that async_scope is useful then we can indeed have join handles for (basically) free—hold on to the future that will finalize and terminate the scope when dropped.

Apart from being a very rough idea that I haven't present well.. that's …. interesting.

Shnatsel · 2022-05-14T17:49:04Z

Ah, turning off parallel color conversion also gated on rayon feature makes this PR 25% slower even on its previously "good" cases, and 30% slower on the large CMYK test image.

I'm afraid this approach strictly inferior to manual parallelization.

If parallelization overhead is a problem for small images, could we perhaps simply skip it heuristically based on the expected image size?

HeroicKatora added 2 commits May 14, 2022 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threaded rayon #246

Threaded rayon #246

HeroicKatora commented May 14, 2022

Shnatsel commented May 14, 2022 •

edited

HeroicKatora commented May 14, 2022

Shnatsel commented May 14, 2022

HeroicKatora commented May 14, 2022

Shnatsel commented May 14, 2022

Threaded rayon #246

Are you sure you want to change the base?

Threaded rayon #246

Conversation

HeroicKatora commented May 14, 2022

Shnatsel commented May 14, 2022 • edited

HeroicKatora commented May 14, 2022

Shnatsel commented May 14, 2022

HeroicKatora commented May 14, 2022

Shnatsel commented May 14, 2022

Shnatsel commented May 14, 2022 •

edited