Performance: Require `BufRead` instead of just `Read` for inputs. #427

anforowicz · 2023-11-15T16:41:35Z

PTAL?

fintelia · 2023-11-27T21:32:13Z

This PR changes the decoder from pulling fixed size 32KB chunks of input data to having the caller push arbitrary sized chunks of data into the decoder. The default for BufReader is 8KB, so that'll likely be a common choice for users who don't have the full image data already loaded into memory. That makes me a bit nervous since it adds a performance parameter out of our control and means there will be edge cases around small buffers that are hard to handle

From testing locally on the QOI benchmark suite, I'm seeing a 1-2% performance regression with a (default sized) BufReader<&[u8]> over a plain &[u8]. I don't expect anyone would actually use that type, but BufReader<File> is probably pretty common and may approximate it.

I wonder if the performance gains you've seen on this were caused by avoiding the 32KB allocation+memset on very small images? If so, we could think about changes to control the size of the buffer based on the image size (likely by switching away from internally using a BufReader, and instead having a standalone buffer for reading IDAT chunks)

anforowicz · 2023-11-27T22:25:10Z

This PR changes the decoder from pulling fixed size 32KB chunks of input data to having the caller push arbitrary sized chunks of data into the decoder. The default for BufReader is 8KB, so that'll likely be a common choice for users who don't have the full image data already loaded into memory. That makes me a bit nervous since it adds a performance parameter out of our control and means there will be edge cases around small buffers that are hard to handle

I wonder if adding a benchmark that uses BufReader would lower the risk a little bit? For example, we could add one additional benchmark testcase: noncompressed-128x128-with-default-bufreader.png. WDYT?

From testing locally on the QOI benchmark suite, I'm seeing a 1-2% performance regression with a (default sized) BufReader<&[u8]> over a plain &[u8]. I don't expect anyone would actually use that type, but BufReader<File> is probably pretty common and may approximate it.

I think your experiment supports merging this PR:

Your experiment shows that an extra BufReader<...> leads to a performance regression (you say that when you compare BufReader<&[u8]> vs &[u8] the version with BufReader is slower by 1-2%)
Your experiment does not approximate what would happen for BufReader<File> (since there is no BufReader<...> in &[u8]). For such approximation, I think that you can try comparing BufReader<u8> with 32kB of buffer (the old default) vs BufReader<u8> with 8kB of buffer (the new default from std).

I wonder if the performance gains you've seen on this were caused by avoiding the 32KB allocation+memset on very small images?

I think the data from the commit message suggests the opposite - the performance gains were much more pronounced for bigger images. I speculate that this is because the savings come from avoiding copying all of the image data, rather than coming from the initial, fixed-size data or memset. According to the commit message the savings looked as follows:

noncompressed-8x8.png:
- [-2.2881% -1.4801% -0.4110%] (p = 0.00 < 0.05)
- [-7.5687% -7.2013% -6.8838%] (p = 0.00 < 0.05)
noncompressed-128x128.png:
- [-12.495% -12.132% -11.760%] (p = 0.00 < 0.05)
- [-10.597% -10.230% -9.8399%] (p = 0.00 < 0.05)

fintelia · 2023-11-28T00:25:01Z

Sorry, I should have been more clear. The performance regression I found only happens with this PR but not on the main branch. That was what concerned me.

Running a different experiment, I tried making this patch to the benchmarks:

--- a/benches/decoder.rs
+++ b/benches/decoder.rs
@@ -48,7 +48,7 @@ fn bench_file(c: &mut Criterion, data: Vec<u8>, name: String) {
     group.throughput(Throughput::Bytes(info.buffer_size() as u64));
     group.bench_with_input(name, &data, |b, data| {
         b.iter(|| {
-            let decoder = Decoder::new(data.as_slice());
+            let decoder = Decoder::new(std::io::BufReader::new(data.as_slice()));
             let mut decoder = decoder.read_info().unwrap();
             decoder.next_frame(&mut image).unwrap();
         })

I then ran each benchmark against the main branch and again against this PR:

decode/kodim17.png
                        time:   [+1.6160% +1.8703% +2.1665%] (p = 0.00 < 0.05)
                        thrpt:  [-2.1206% -1.8359% -1.5903%]

decode/kodim07.png
                        time:   [+1.3616% +1.6080% +1.9906%] (p = 0.00 < 0.05)
                        thrpt:  [-1.9517% -1.5826% -1.3433%]

decode/kodim02.png
                        time:   [+2.9415% +3.2626% +3.5050%] (p = 0.00 < 0.05)
                        thrpt:  [-3.3863% -3.1595% -2.8574%]

decode/kodim23.png
                        time:   [-0.3615% +0.5706% +1.3460%] (p = 0.21 > 0.05)
                        thrpt:  [-1.3281% -0.5674% +0.3629%]

decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png
                        time:   [+6.5478% +6.8271% +7.1205%] (p = 0.00 < 0.05)
                        thrpt:  [-6.6472% -6.3908% -6.1454%]

decode/Transparency.png time:
                        time:   [+1.7512% +2.0166% +2.3704%] (p = 0.00 < 0.05)
                        thrpt:  [-2.3156% -1.9768% -1.7211%]

decode/generated-png:noncompressed-8x8.png
                        time:   [-0.0576% +0.1127% +0.2824%] (p = 0.20 > 0.05)
                        thrpt:  [-0.2816% -0.1125% +0.0576%]

decode/generated-png:noncompressed-128x128.png
                        time:   [+21.104% +21.693% +22.380%] (p = 0.00 < 0.05)
                        thrpt:  [-18.287% -17.826% -17.427%]

I'll also note I'm somewhat biased here. This API change would have some ripple effects for parts of the main image crate's API, probably requiring that crate to either unconditionally wrap PNG input streams in a BufReader or require all users of all decoders to adhere to the BufRead trait bound. If possible, I'd rather find a way to avoid that without sacrificing performance

anforowicz · 2023-11-28T00:47:29Z

@fintelia - could you please upload your benchmarking setup to a github repo + point out the hashes of the before/after commits? This would help me understand what is being compared in your experiment.

Is it maybe like this?:

Before: origin/master (i.e. f80dfe9) + benches/decoder.rs changes from your comment above (resulting in 2 BufReaders: first 8kB buffering and then 32kB buffering)
After: this PR (i.e. f24d4ff) + benches/decoder.rs changes from your comment above (only 1 BufReader with 8kB buffer)

This API change would have some ripple effects for parts of the main image crate's API, probably requiring that crate to either unconditionally wrap PNG input streams in a BufReader or require all users of all decoders to adhere to the BufRead trait bound.

Yes, this is true. I think I'd prefer to require all users of image::codecs::png::PngDecoder to adhere to the BufRead trait bound (but not necessarily "all decoders").

I am also biased - the speed of decoding of in-memory PNGs is most important to me, because this is the form of PNG input in Chromium's //ui/gfx and in Chromium's Blink (in fact, if you squint your eyes, then blink::SegmentReader::GetSomeData is a little bit like BufReader::fill_buf, although there is no equivalent of BufReader::consume and blink::PngImageReader has to track "consumption" by managing the position argument / read_offset_ field).

I think that understanding the performance impact of this PR is important, so let's continue the discussion here. At the same time, maybe the 2 PRs at #429 and #428 are less controversial and maybe this is where we can focus the review efforts for now?

Landing those other 2 PRs first may be desirable to:

Magnify the impact of the BufReader changes (and make it more easily measurable/detectable) by landing the other improvements first. This should help to magnify any negative as well as positive impact.
Bundle the breaking changes together, as you've suggested elsewhere

anforowicz · 2023-11-28T23:32:43Z

BTW, one additional argument for not merging this PR yet is that so far Chromium has not landed any code that uses the png crate. Hopefully this will happen soon, but I can understand the desire to prioritize the experience of existing consumers of the png crate (over the hypothetical, future experience of aspirational consumers - like Chromium).

Still, I am curious about the benchmarking results above, and would like to debug and understand them better. (Although as I said before, this seems lower priority than discussing the other 2, less controversial PRs.)

anforowicz · 2024-01-04T16:23:37Z

For full transparency, let me share some recent observations. I have rebased this PR on top of the latest changes, but when rerunning the benchmarks I've realized that for some testcases (e.g. noncompressed, 2048x2048 image, split across 64kB IDAT chunks) I observe a significant regression. The binary size and the number of instructions goes down (6550545993 instructions => 6358944886 instructions - 3% improvement), but the stalled backend cycles increase from 5.37% backend cycles idle to 40.61% backend cycles idle (which translates into a regression of the overall runtime). I think that getting rid of the intermediate BufReader means that decompression can consume much bigger chunks of input at a time (whole IDAT - 64kB, instead of the size of the intermediate BufReader - 32kB) and I am guessing that this change is somehow negatively impacting hardware-driven memory prefetching. I don't yet understand what is happening exactly, what actions we can take, and what it means for this PR.

fintelia · 2024-01-04T17:25:47Z

Any chance your CPU has a 64KB L1 cache? The current approach does three copies with size=32KB: input -> BufReader -> out_buffer -> data_stream, which should stay inside the L1 cache.

This PR changes it to two copies input -> out_buffer -> data_stream, but in the process makes the copy size depend on the underlying image's IDAT sizes. Which with 64KB IDATs means that the working set for each copy becomes 128KB

anforowicz · 2024-01-09T17:02:32Z

Let me convert this PR to a "draft", so that it won't get accidentally merged before we understand the performance impact better. There are multiple factors at play, so I think that (instead of continuing the discussion here) I'll try to post something to #416.

This commit makes a breaking API change - it changes the `R: Read` constraint (in `Decoder` and `Reader` structs) to the `R: BufRead` constraint. This helps performance by avoiding copying the input data into an additional, intermediate `BufReader` that used to be stored in the (internal) `ReadDecoder::reader` field (after these changes that field is `R` rather than `BufReader`). In particular, some input types (e.g. when decoding from a `&[u8]`) already implement `BufRead` and for such types it is wasteful to introduce additional buffering via `BufReader`. The impact of the change is significant, but relatively small - this means that it mostly shows up in `noncompressed...` benchmarks which magnify the cost of code paths that are not related to `fdeflate` nor `unfilter`. Impact on benchmark runtime looks as follows (run once, and then rerun after compiling before+after with a fresh nightly `rustc`): * kodim02.png: - No change in performance detected (p = 0.08 > 0.05) - [+1.3713% +1.7241% +2.0960%] (p = 0.00 < 0.05) * kodim07.png: - [-1.1466% -0.6693% -0.2705%] (p = 0.00 < 0.05) - No change in performance detected. (p = 0.35 > 0.05) * kodim17.png: - [-2.3062% -1.2878% +0.1746%] (p = 0.05 < 0.05) - [-2.7355% -1.9939% -0.7986%] (p = 0.00 < 0.05) * kodim23.png: - No change in performance detected. (p = 0.51 > 0.05) - [-1.4834% -1.0648% -0.6692%] (p = 0.00 < 0.05) * Lohengrin...png: - [-2.0606% -1.7935% -1.4756%] (p = 0.00 < 0.05) - [-4.2412% -3.6723% -3.0327%] (p = 0.00 < 0.05) * Transparency.png: - [+1.4991% +1.8812% +2.3429%] (p = 0.00 < 0.05) - [-0.7939% -0.5746% -0.3590%] (p = 0.00 < 0.05) * noncompressed-8x8.png: - [-2.2881% -1.4801% -0.4110%] (p = 0.00 < 0.05) - [-7.5687% -7.2013% -6.8838%] (p = 0.00 < 0.05) * noncompressed-128x128.png: - [-12.495% -12.132% -11.760%] (p = 0.00 < 0.05) - [-10.597% -10.230% -9.8399%] (p = 0.00 < 0.05)

Similarily to image-rs@1636b55 this commit tries to ensure that the working set fits into the L1 cache. Before this commit, the whole `ZlibStream::out_buffer` could be filled out and this buffer is potentially bigger than the typical 32kB of the L1 cache. After this commit, `MAX_INCREMENTAL_DECOMPRESSION_SIZE` limits how many bytes can be written to `ZlibStream::out_buffer` in a single call to `fdeflate::Decompressor::read`.

After this commit the size of the initial allocation of `ZlibStream::out_buffer` should be big enough to avoid having to grow the capacity of the buffer backing this vector of bytes. Removing this overhead is expected to have a positive impact on runtime performance.

This was referenced Nov 15, 2023

Remove Reader::scrach_buffer field + resulting breaking API changes. #421

Open

Cap buffer sizes via ZlibStream::set_max_total_output. #429

Merged

anforowicz force-pushed the bufread-for-performance branch from f24d4ff to e2eb3f8 Compare December 17, 2023 17:36

anforowicz force-pushed the bufread-for-performance branch from e2eb3f8 to c67d90e Compare January 3, 2024 23:15

anforowicz mentioned this pull request Jan 6, 2024

Avoid 32kB decompression lag + compact less often. #447

Merged

anforowicz marked this pull request as draft January 9, 2024 17:00

anforowicz added 3 commits January 10, 2024 20:23

anforowicz force-pushed the bufread-for-performance branch from c67d90e to b2495a3 Compare January 10, 2024 20:23

fintelia mentioned this pull request Jan 20, 2024

Bypass state machine and require BufRead #458

Draft

fintelia changed the base branch from master to next January 20, 2024 22:42

fintelia mentioned this pull request Feb 19, 2024

Add BufRead + Seek bound on many decoders image-rs/image#2149

Merged

fintelia mentioned this pull request Feb 26, 2024

Decoding out of image::io::Reader directly into user-controlled buffer image-rs/image#2162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Require `BufRead` instead of just `Read` for inputs. #427

Performance: Require `BufRead` instead of just `Read` for inputs. #427

anforowicz commented Nov 15, 2023

fintelia commented Nov 27, 2023 •

edited

anforowicz commented Nov 27, 2023

fintelia commented Nov 28, 2023

anforowicz commented Nov 28, 2023

anforowicz commented Nov 28, 2023

anforowicz commented Jan 4, 2024

fintelia commented Jan 4, 2024

anforowicz commented Jan 9, 2024

Performance: Require BufRead instead of just Read for inputs. #427

Are you sure you want to change the base?

Performance: Require BufRead instead of just Read for inputs. #427

Conversation

anforowicz commented Nov 15, 2023

fintelia commented Nov 27, 2023 • edited

anforowicz commented Nov 27, 2023

fintelia commented Nov 28, 2023

anforowicz commented Nov 28, 2023

anforowicz commented Nov 28, 2023

anforowicz commented Jan 4, 2024

fintelia commented Jan 4, 2024

anforowicz commented Jan 9, 2024

Performance: Require `BufRead` instead of just `Read` for inputs. #427

Performance: Require `BufRead` instead of just `Read` for inputs. #427

fintelia commented Nov 27, 2023 •

edited