Avoid 32kB decompression lag + compact less often. #447

anforowicz · 2024-01-05T01:36:19Z

Avoiding 32kB decompression lag

Before this commit, decompressed data would be accumulated in ZlibStream::out_buffer and returned via image_data with 32kB lag corresponding to CHUNCK_BUFFER_SIZE:

```
fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
    let safe = self.out_pos.saturating_sub(CHUNCK_BUFFER_SIZE);
    image_data.extend(self.out_buffer.drain(..safe));
    ...
```

32kB is a typical size of L1 cache, so the lag would mean that the data passed to image_data.extend(...) would be already cold and evicted from the L1 cache.

This commit avoids the lag by always returning into image_data all the data from out_buffer (i.e. data up to out_pos):

```
fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
    let transferred = &self.out_buffer[self.read_pos..self.out_pos];
    image_data.extend_from_slice(transferred);
    self.read_pos = self.out_pos;
    ...
```

Compacting less often

The changes above mean that Vec::drain no longer compacts out_buffer. Therefore this commit also refactors how this compaction works.

Before this commit, not-yet-returned data would be shifted to the beginning of out_buffer every time transfer_finished_data is called. This could potentially mean that for 1 returned byte, N bytes have to be copied during compaction.

After this commit, compaction is only done when the compaction cost if offset by many read bytes - for 3 returned bytes 1 byte has to be copied during compaction.

Performance impact

The commit has a positive impact on performance, except for:

decode/Transparency.png - regression between 15% and 20% is reported in 3-out-of-3 measurements.
decode/kodim17.png - a regression of 2.1% has been reported in 1-out-of-3 measurements (an improvement of 0.6% - 1.13% has been reported in the other 2-out-of-3 measurements).
generated-noncompressed-64k-idat/128x128.png - a regression of 25% has been reported in 1-out-of-3 measurements (an improvement of 21% - 29% has been reported in the other 2-out-of-3 measurements).

The results below have been gathered by running the decoder benchmark. First a baseline was saved before this commit, and then a comparison was done after the commit. This (the baseline + the comparison) was repeated a total of 3 times. All results below are for the relative impact on the runtime. All results are with p = 0.00 < 0.05.

decode/kodim23.png:
* [-2.9560% -2.7112% -2.4009%]
* [-3.4876% -3.3406% -3.1928%]
* [-3.0559% -2.9208% -2.7787%]
decode/kodim07.png:
* [-1.2527% -1.0110% -0.6780%]
* [-1.7851% -1.6558% -1.5164%]
* [-1.6576% -1.5216% -1.3856%]
decode/kodim02.png:
* [-0.5108% -0.2806% -0.0112%]
* [-1.0885% -0.9493% -0.8118%]
* [-0.5563% -0.4239% -0.2874%]
decode/kodim17.png:
* [+1.8649% +2.1138% +2.4169%] (regression)
* [-1.2891% -1.1322% -0.9736%]
* [-0.7753% -0.6276% -0.4866%]
decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png:
* [-1.7165% -1.4968% -1.2650%]
* [-1.7051% -1.4473% -1.2229%]
* [-1.2544% -1.0457% -0.8375%]
decode/Transparency.png:
* [+19.329% +19.789% +20.199%] (regression)
* [+15.337% +15.798% +16.294%] (regression)
* [+18.694% +19.106% +19.518%] (regression)
generated-noncompressed-4k-idat/8x8.png:
* [-2.3295% -1.9940% -1.5912%]
* [-6.1285% -5.8872% -5.6091%]
* [-2.8814% -2.6787% -2.4820%]
generated-noncompressed-4k-idat/128x128.png:
* [-59.793% -59.599% -59.417%]
* [-63.930% -63.846% -63.756%]
* [-62.377% -62.248% -62.104%]
generated-noncompressed-4k-idat/2048x2048.png:
* [-67.678% -67.579% -67.480%]
* [-65.616% -65.519% -65.429%]
* [-65.824% -65.647% -65.413%]
generated-noncompressed-4k-idat/12288x12288.png:
* [-60.932% -60.774% -60.528%]
* [-62.088% -62.016% -61.940%]
* [-61.663% -61.604% -61.546%]
generated-noncompressed-64k-idat/128x128.png:
* [-22.237% -21.975% -21.701%]
* [-29.656% -29.480% -29.311%]
* [+24.812% +25.190% +25.571%] (regression)
generated-noncompressed-64k-idat/2048x2048.png:
* [-21.826% -21.499% -21.087%]
* [-54.279% -54.049% -53.715%]
* [-11.174% -10.828% -10.482%]
generated-noncompressed-64k-idat/12288x12288.png:
* [-40.421% -40.311% -40.180%]
* [-39.496% -39.183% -38.871%]
* [-41.443% -41.367% -41.295%]
generated-noncompressed-2g-idat/2048x2048.png:
* [-40.136% -40.010% -39.865%]
* [-58.507% -58.333% -58.060%]
* [-35.822% -35.457% -35.038%]
generated-noncompressed-2g-idat/12288x12288.png:
* [-37.196% -37.107% -37.014%]
* [-36.125% -36.049% -35.970%]
* [-35.636% -35.477% -35.350%]

anforowicz · 2024-01-05T01:48:20Z

I think this may be worth landing - the performance improvements can be seen everywhere except the Transparency testcase.

I don't really understand what happens in the Transparency testcase. I can recover some of the regression with an additional commit in master...anforowicz:image-png:cap-incremental-decompression-size. Maybe we should also land this other commit (as a separate PR?). OTOH after this other commit the performance gains elsewhere are still there, but slightly smaller. I don't know how to decide between 1) landing just this commit/PR vs 2) landing this commit + the other commit. I think I am leaning toward just landing this commit/PR for now.

FWIW, I've also tested this commit on some small images from the top-500 benchmark described in #416 (comment). I've measured 10% - 15% performance improvements for those small images (Transparency.jpg is also small, but apparently being small is not the deciding factor).

fintelia · 2024-01-05T21:28:22Z

src/decoder/zlib.rs

+            // We can't reuse the rest of `out_buffer` because it hasn't been zeroed-out (as
+            // expected by `fdeflate`.  Because of this, we `truncate` the `out_buffer`.
+            self.out_buffer.truncate(preserved_len);


I eliminated the requirement in fdeflate about zeroing the buffer, so this shouldn't be needed

Done - removed these 3 lines.

This changes how much will be decompressed in subsequent calls to fn decompress though - this probably means that I should redo the performance measurements to double-check the performance impact is still the same. I'll try to do that next week.

fintelia · 2024-01-05T21:31:03Z

The Transparency regression is a bit weird, but I'm OK with merging this given all the other improvements.

While looking at this PR, I noticed we're not properly handling images with extra data. Specifically, it is possible that there are additional bytes in the IDATs beyond the deflate checksum. If that happens fdeflate will set is_done and stop consuming additional input, and the ZlibStream then no-longer needs to maintain the lookback buffer. In libpng at least, this case is considered a "benign error" and by default doesn't prevent the image from being decoded.

anforowicz · 2024-01-06T00:18:16Z

Thanks for looking!

The Transparency regression is a bit weird, but I'm OK with merging this given all the other improvements.

Ack.

I won't have time for it this week, but I plan to eventually try to understand master...anforowicz:image-png:cap-incremental-decompression-size better:

try to come up with a hypothesis why it improves Transparency (I think the other commit also helps to keep things in the L1 cache, but I don't understand why the current-PR-under-review makes this necessary; and I want to have higher confidence in the other commit because I suspect that it will also help to avoid Read/BufRead regressions and I don't want to advocate blindly for the other commit without understanding why it helps on its own)
try to understand why 16k helps but 8k doesn't (+ check what happens with 32k even though this feels wrong because it would fill the whole L1 cache)
measure performance comparison against 1) baseline and 2) this PR (i.e. current-PR-under-review)
maybe measure stalled backend cycles (and L1 misses) before and after the other commit

While looking at this PR, I noticed we're not properly handling images with extra data. Specifically, it is possible that there are additional bytes in the IDATs beyond the deflate checksum. If that happens fdeflate will set is_done and stop consuming additional input, and the ZlibStream then no-longer needs to maintain the lookback buffer. In libpng at least, this case is considered a "benign error" and by default doesn't prevent the image from being decoded.

Ack.

Question: is this something independent from the PR-under-review? Or is this a problem that you think the PR-under-review may be introducing or making worse?

I can help with adding a test for this and checking what happens. I think the new test can mostly mimic test_idat_bigger_than_image_size_from_ihdr (but instead of writing an IDAT chunk with a bigger image, it can just write the right-size-image and append some extraneous bytes).

BTW, I forgot to say that I hope that the PR-under-review doesn't get in the way for your idea from #429 (comment) of "changing ZlibStream to directly write into a caller provided buffer instead of using its own out_buffer". Avoiding the 32kB lag seems orthogonal/independent from this idea. Changing how out_buffer compaction works is somewhat related, but hopefully still doesn't get in the way too much.

FWIW, I mostly got motivated to look here because of how the Read/BufRead PR happens to interfere with L1 cache, hardware prefetches, etc. (#427)

Avoiding 32kB decompression lag =============================== Before this commit, decompressed data would be accumulated in `ZlibStream::out_buffer` and returned via `image_data` with 32kB lag corresponding to `CHUNCK_BUFFER_SIZE`: ``` fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize { let safe = self.out_pos.saturating_sub(CHUNCK_BUFFER_SIZE); image_data.extend(self.out_buffer.drain(..safe)); ... ``` 32kB is a typical size of L1 cache, so the lag would mean that the data passed to `image_data.extend(...)` would be already cold and evicted from the L1 cache. This commit avoids the lag by always returning into `image_data` all the data from `out_buffer` (i.e. data up to `out_pos`): ``` fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize { let transferred = &self.out_buffer[self.read_pos..self.out_pos]; image_data.extend_from_slice(transferred); self.read_pos = self.out_pos; ... ``` Compacting less often ===================== The changes above mean that `Vec::drain` no longer compacts `out_buffer`. Therefore this commit also refactors how this compaction works. Before this commit, not-yet-returned data would be shifted to the beginning of `out_buffer` every time `transfer_finished_data` is called. This could potentially mean that for 1 returned byte, N bytes have to be copied during compaction. After this commit, compaction is only done when the compaction cost if offset by many read bytes - for 3 returned bytes 1 byte has to be copied during compaction. Performance impact ================== The commit has a positive impact on performance, except for: * `decode/Transparency.png` - regression between 15% and 20% is reported in 3-out-of-3 measurements. * `decode/kodim17.png` - a regression of 2.1% has been reported in 1-out-of-3 measurements (an improvement of 0.6% - 1.13% has been reported in the other 2-out-of-3 measurements). * `generated-noncompressed-64k-idat/128x128.png` - a regression of 25% has been reported in 1-out-of-3 measurements (an improvement of 21% - 29% has been reported in the other 2-out-of-3 measurements). The results below have been gathered by running the `decoder` benchmark. First a baseline was saved before this commit, and then a comparison was done after the commit. This (the baseline + the comparison) was repeated a total of 3 times. All results below are for the relative impact on the runtime. All results are with p = 0.00 < 0.05. * decode/kodim23.png: * [-2.9560% -2.7112% -2.4009%] * [-3.4876% -3.3406% -3.1928%] * [-3.0559% -2.9208% -2.7787%] * decode/kodim07.png: * [-1.2527% -1.0110% -0.6780%] * [-1.7851% -1.6558% -1.5164%] * [-1.6576% -1.5216% -1.3856%] * decode/kodim02.png: * [-0.5108% -0.2806% -0.0112%] * [-1.0885% -0.9493% -0.8118%] * [-0.5563% -0.4239% -0.2874%] * decode/kodim17.png: * [+1.8649% +2.1138% +2.4169%] (**regression**) * [-1.2891% -1.1322% -0.9736%] * [-0.7753% -0.6276% -0.4866%] * decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png: * [-1.7165% -1.4968% -1.2650%] * [-1.7051% -1.4473% -1.2229%] * [-1.2544% -1.0457% -0.8375%] * decode/Transparency.png: * [+19.329% +19.789% +20.199%] (**regression**) * [+15.337% +15.798% +16.294%] (**regression**) * [+18.694% +19.106% +19.518%] (**regression**) * generated-noncompressed-4k-idat/8x8.png: * [-2.3295% -1.9940% -1.5912%] * [-6.1285% -5.8872% -5.6091%] * [-2.8814% -2.6787% -2.4820%] * generated-noncompressed-4k-idat/128x128.png: * [-59.793% -59.599% -59.417%] * [-63.930% -63.846% -63.756%] * [-62.377% -62.248% -62.104%] * generated-noncompressed-4k-idat/2048x2048.png: * [-67.678% -67.579% -67.480%] * [-65.616% -65.519% -65.429%] * [-65.824% -65.647% -65.413%] * generated-noncompressed-4k-idat/12288x12288.png: * [-60.932% -60.774% -60.528%] * [-62.088% -62.016% -61.940%] * [-61.663% -61.604% -61.546%] * generated-noncompressed-64k-idat/128x128.png: * [-22.237% -21.975% -21.701%] * [-29.656% -29.480% -29.311%] * [+24.812% +25.190% +25.571%] (**regression**) * generated-noncompressed-64k-idat/2048x2048.png: * [-21.826% -21.499% -21.087%] * [-54.279% -54.049% -53.715%] * [-11.174% -10.828% -10.482%] * generated-noncompressed-64k-idat/12288x12288.png: * [-40.421% -40.311% -40.180%] * [-39.496% -39.183% -38.871%] * [-41.443% -41.367% -41.295%] * generated-noncompressed-2g-idat/2048x2048.png: * [-40.136% -40.010% -39.865%] * [-58.507% -58.333% -58.060%] * [-35.822% -35.457% -35.038%] * generated-noncompressed-2g-idat/12288x12288.png: * [-37.196% -37.107% -37.014%] * [-36.125% -36.049% -35.970%] * [-35.636% -35.477% -35.350%]

anforowicz · 2024-01-08T23:24:44Z

Just a quick follow-up to #447 (comment) - I re-measured the performance and the results were mostly the same (i.e. as before addressing the comment / as with truncating and re-zero-ing out out_buffer). For the record, the results against the landed commit look as follows:

decode/kodim23.png:
[-3.1886% -3.0384% -2.8277%]
[-3.6101% -3.5022% -3.3806%]
[-3.9165% -3.7417% -3.6003%]

decode/kodim07.png:
[-1.1782% -1.0010% -0.8080%]
[-1.8537% -1.7999% -1.7421%]
[-1.5936% -1.3810% -1.2046%]

decode/kodim02.png:
[-1.9088% -1.7352% -1.5557%]
[-2.0096% -1.9522% -1.8895%]
[-2.2580% -2.0380% -1.8558%]

decode/kodim17.png:
[-0.9286% -0.7875% -0.5969%]
[-0.6255% -0.5532% -0.4817%]
[-1.5831% -1.3538% -1.1580%]

decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png:
[-1.9583% -1.6850% -1.3973%]
[-1.5138% -1.3103% -1.1106%]
[-1.4271% -1.1675% -0.8888%]

decode/Transparency.png:
[+14.271% +14.950% +15.689%]
[+14.740% +15.057% +15.443%]
[+13.514% +14.260% +15.206%]

generated-noncompressed-4k-idat/8x8.png:
[-1.1170% -0.8610% -0.6001%]
[-3.3542% -3.2026% -3.0460%]
[-6.8021% -6.6249% -6.4542%]

generated-noncompressed-4k-idat/128x128.png:
[-62.153% -62.038% -61.913%]
[-60.042% -59.859% -59.711%]
[-61.204% -61.130% -61.037%]

generated-noncompressed-4k-idat/2048x2048.png:
[-67.161% -67.121% -67.082%]
[-67.566% -67.470% -67.370%]
[-67.706% -67.663% -67.614%]

generated-noncompressed-4k-idat/12288x12288.png:
[-63.990% -63.956% -63.921%]
[-63.580% -63.504% -63.427%]
[-63.302% -63.266% -63.234%]

generated-noncompressed-64k-idat/128x128.png:
[+28.550% +28.972% +29.330%]
[+45.749% +46.366% +46.873%]
[-13.531% -13.292% -13.085%]

generated-noncompressed-64k-idat/2048x2048.png:
[-13.441% -12.212% -11.403%]
[-22.578% -22.234% -21.879%]
[-20.424% -20.222% -19.999%

generated-noncompressed-64k-idat/12288x12288.png:
[-45.895% -45.826% -45.749%]
[-41.836% -41.724% -41.617%]
[-41.325% -41.252% -41.165%]

generated-noncompressed-2g-idat/2048x2048.png:
[-40.284% -39.676% -39.277%]
[-38.249% -37.947% -37.701%]
[-42.957% -42.806% -42.651%]

generated-noncompressed-2g-idat/12288x12288.png:
[-38.550% -38.469% -38.388%]
[-37.827% -37.691% -37.509%]
[-37.335% -37.269% -37.207%]

fintelia reviewed Jan 5, 2024

View reviewed changes

anforowicz force-pushed the no-32kB-delay-2nd-attempt branch from 7228d12 to cb0b595 Compare January 6, 2024 00:24

fintelia merged commit 1636b55 into image-rs:master Jan 6, 2024
19 checks passed

anforowicz deleted the no-32kB-delay-2nd-attempt branch January 8, 2024 20:58

anforowicz mentioned this pull request Jan 22, 2024

Use a single Vec between Decoder and ZlibStream #452

Draft

fintelia mentioned this pull request Jan 27, 2024

Infinite loop on decoding a crafted image #446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid 32kB decompression lag + compact less often. #447

Avoid 32kB decompression lag + compact less often. #447

anforowicz commented Jan 5, 2024 •

edited

anforowicz commented Jan 5, 2024

fintelia Jan 5, 2024

anforowicz Jan 6, 2024 •

edited

fintelia commented Jan 5, 2024

anforowicz commented Jan 6, 2024

anforowicz commented Jan 8, 2024

Avoid 32kB decompression lag + compact less often. #447

Avoid 32kB decompression lag + compact less often. #447

Conversation

anforowicz commented Jan 5, 2024 • edited

Avoiding 32kB decompression lag

Compacting less often

Performance impact

anforowicz commented Jan 5, 2024

fintelia Jan 5, 2024

Choose a reason for hiding this comment

anforowicz Jan 6, 2024 • edited

Choose a reason for hiding this comment

fintelia commented Jan 5, 2024

anforowicz commented Jan 6, 2024

anforowicz commented Jan 8, 2024

anforowicz commented Jan 5, 2024 •

edited

anforowicz Jan 6, 2024 •

edited