Rewrite deflate memory allocation. #1713

Dead2 · 2024-04-23T10:12:53Z

This PR currently lacks s390x support entirely. @iii-i is working on updating that support.

This PR moves all allocations to come from a single buffer allocated during init, this has many positive sides such as being faster overall, and in the case of inflate; ensuring that as much as possible of the startup cost is done during init, leading to more predictable behavior and performance of inflate itself.
I have also added debug logging to the allocation functions, as it is now simple and self-contained, although it could likely use some improvements to be more portable.
This should also fix #1708 and similar errors, although that is not the motivation for these changes.

Deflate

Deflate used to call allocate 5 times during init.

5 calls to external alloc function now becomes 1
Handling alignment of allocated buffers is simplified
- Efforts to align the allocated buffer now needs to happen only once.
- Individual buffers are ordered so that they have natural sequential alignment.
Due to reduced losses to alignment, we allocate less memory in total.
While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
with applications that incorrectly set alloc/free pointers after running init function.
Removed need for extra padding after window, chunked reads can now go beyond the window
buffer without causing a segfault.

Inflate

Inflate used to allocate state during init, but window would be allocated
when/if needed and could be resized and that required a new free/alloc round.

Now, we allocate state and a 32K window during init, allowing the latency cost
of allocs to be done during init instead of at one or more times later.
Total memory allocation is about the same when requesting a 32K window, but
if now window or a smaller window was requested, then it is an increase.
While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
with applications that incorrectly set alloc/free pointers after running init function.
After init has succeeded, inflate will no longer possibly fail due to a failing malloc.

Performance

Preliminary benchmarks show that compress() of 1 byte is ~9% faster.
Compress() of 32KiB is 3.6% faster.
Bigger files show less improvements due to most of the savings happening during deflate init().
Benchmark used is in PR #1721

deflate.c

+/* ===========================================================================
+ * Free all allocated deflate buffers
+ */
+static inline void free_deflate(PREFIX3(stream) *strm) {


zbuild.h

Dead2 · 2024-04-23T17:35:08Z

@iii-i Any chance you have the time for commenting on the feasibility for porting the DFLTCC code to using this allocator and removing most of the custom allocation code? Any obvious problems that need to be caught early?

I have looked at the code, and I think it is doable, but I don't really know that part of the code at all, and it'd probably require trial-and-error to implement this I think. If you want to contribute a patch, that'd be awesome too. Worst case (I really want to avoid doing that though) is doing a point release without s390x DFLTCC support while it gets brought up to speed.

inflate.c

deflate.c

nmoinvaz · 2024-04-23T19:46:26Z

test/benchmarks/benchmark_compress.cc

+    } \
+    BENCHMARK_REGISTER_F(compress, name)->Arg(1)->Arg(8)->Arg(16)->Arg(32)->Arg(64)->Arg(512)->Arg(4<<10)->Arg(32<<10);
+
+BENCHMARK_COMPRESS(compress);


Usually benchmark would test multiple versions of similar functions, but here there is only one. What is the purpose of this file?

Before this PR:

compress/compress/1 6157 ns 6147 ns 105599 compress/compress/8 6666 ns 6654 ns 100625 compress/compress/16 6873 ns 6862 ns 99180 compress/compress/32 7663 ns 7650 ns 89458 compress/compress/64 8265 ns 8251 ns 83929 compress/compress/512 8208 ns 8194 ns 82918 compress/compress/4096 9570 ns 9553 ns 72661 compress/compress/32768 19308 ns 19272 ns 36328

With this PR:

compress/compress/1 5618 ns 5610 ns 113507 compress/compress/8 5984 ns 5974 ns 109259 compress/compress/16 6097 ns 6087 ns 111210 compress/compress/32 6828 ns 6817 ns 100269 compress/compress/64 7435 ns 7424 ns 91821 compress/compress/512 7417 ns 7405 ns 91981 compress/compress/4096 8785 ns 8771 ns 79245 compress/compress/32768 18541 ns 18510 ns 37689

PS: Neither of these are up to date, but they do illustrate the point of using them for comparisons across different commits.

Benchmarking this PR with deflatebench is pretty useless, it has too much overhead to reliably distinguish speed differences for compressing/decompressing such small files. It can be seen on the total runtime over hundreds of runs on small files.
This is a much better benchmark for that kind of thing since it runs the benchmark a lot closer to the library itself and without all the forking and external monitoring and such.

iii-i · 2024-04-30T00:58:10Z

Hi, I think there are no huge compatibility problems for DFLTCC in this PR. I currently have a draft commit that adapts the code and survives the regtest: https://github.com/iii-i/zlib-ng/releases/tag/single-malloc-dfltcc-20240430.

Dead2 · 2024-04-30T11:20:52Z

Hi, I think there are no huge compatibility problems for DFLTCC in this PR. I currently have a draft commit that adapts the code and survives the regtest: iii-i/zlib-ng@single-malloc-dfltcc-20240430 (release).

That is great. Maybe you should do a PR for that based on top of this.
When looking through it, I notice it looks like a pretty substantial cleanup 👍.
I only noticed one thing I'd prefer changed; in deflate.h, move arch_deflate_state arch; ~10 lines up, to right below deflate_allocs *alloc_bufs;. That group has a natural 16-byte alignment, it should not have any negative effect whether it is there or not, and it avoids having it dangling at the end after the padding bytes 😄.
There might also be a potential for getting rid of dfltcc_common.c, as ZCOPY_WINDOW is only used from inflate.

iii-i · 2024-05-02T16:53:33Z

Will do. Unfortunately I found some issues which I don't fully understand yet; but now I have at least a couple bite-sized commits that make sense on their own, which I will post as separate PRs (here is the first one: #1717).

nmoinvaz · 2024-05-06T15:06:43Z

Interesting removal of ZCOPY_DEFLATE_STATE and the one for inflate also. Is it no longer necessary? I wonder what kind of optimization DFLTCC was doing.

Configure script changes LGTM.

Dead2 · 2024-05-06T17:24:50Z

Interesting removal of ZCOPY_DEFLATE_STATE and the one for inflate also. Is it no longer necessary? I wonder what kind of optimization DFLTCC was doing.

I was initially uncertain whether to keep those around or not, since I was unsure what DFLTCC needed to do. But when @iii-i posted his initial patch that also removed those, I figured I might as well do so in this patch where it fits with the other changes in the same areas, and then the s390x changes will be more self-contained and easier to review as well.

Dead2 · 2024-05-06T21:09:19Z

Rebased

iii-i · 2024-05-06T22:37:59Z

ZCOPY_DEFLATE_STATE is not an optimization - DFLTCC needs to maintain state of its own, it is allocated alongside deflate/inflate state, and needs to be copied on deflateCopy() / inflateCopy(). In order to make things as unintrusive as possible, in my zlib patch I chose not to modify the structs, but rather to introduce hooks and do pointer arithmetics there. This rewrite is an opportunity to simplify things (#1718) at the cost of making DFLTCC support a little bit more intrusive, which is hopefully okay.

iii-i · 2024-05-06T22:40:06Z

P.S. I wonder in which order this and #1718 should go in? I'd prefer to land #1718 first, since it's more local and as far as I'm concerned is ready.

phprus · 2024-05-18T09:32:53Z

Fix for arch/s390/dfltcc_detail.h:222:12: error: call to undeclared function 'ZALLOC': #1728
Fix for CI: #1726, #1725

KungFuJesus · 2024-05-18T14:32:04Z

So, putting on some exploit mitigation goggles, I do worry about this one a tiny bit:

While doing alloc(), we now store pointer to corresponding free(), avoiding crashes with applications that incorrectly set alloc/free pointers after running init function.

Doesn't this mean we effectively are giving a user controlled pointer for every allocation? Granted we were already doing that at library initialization time, I'm just wondering if giving this to something possibly controllable by the user, not just the developer, becomes a bit dangerous. I get that this makes us resilient to a pretty brutal crash from mono but...it just seems like it might be easy to craft a gzip or png file that could readily exploit this. Yes, I realize that we are the allocators here and not the user, I'm just wondering if a well crafted input file could lie about remaining lengths such that we end up writing a malicious location behind a buffer. This is now several pointers that are vulnerable, rather than just one in the init structure that is outside of the purview of the user. Am I wrong or overly paranoid here?

Dead2 · 2024-05-18T17:49:16Z

@KungFuJesus I'm not sure this PR really changes that.

The free pointer is set during init() both now and before, but now we copy it to somewhere where the application using zlib can't easily change it after init() (at least without accessing the structs and buffers that have no exported interfaces to make sense of them).

Doesn't this mean we effectively are giving a user controlled pointer for every allocation? Granted we were already doing that at library initialization time, I'm just wondering if giving this to something possibly controllable by the user, not just the developer, becomes a bit dangerous.

How does this become controllable by the user? The developer of the application using zlib needs to set this through the z_stream struct during init(). Who is the user here?

This is now several pointers that are vulnerable, rather than just one in the init structure that is outside of the purview of the user.

I am not sure what you mean by the init structure is outside the purview of the user. The "init structure" or z_stream is passed to zlib for every function call from the application, and contains the pointer both to free, malloc and to the internal state struct. (And the internal state struct also contains a pointer back to z_stream).

We also have other function pointers in structs in various places.
I'm not sure what makes this pointer to free any worse than those.

I fear we'd have to do a lot more invasive changes if we wanted to harden all our pointers from attacks. Possibly we could use mproctect() / VirtualProtect(), but that'd mean moving all function pointers to a separate page-sized+aligned buffer. Also it'd interfere with things like changing deflate levels.

KungFuJesus · 2024-05-18T19:18:26Z

Sorry I realize some of this was poorly articulated. Looking at the code now, I see it's set in the strm struct, not per allocation as I had anticipated based on that summary. We're no worse off for that. Had it been a pointer at the beginning of the allocation it might have been another story (though, I think the inflate code always writes forward in the buffer, never backward).

So as it stands now (after this PR), if the user tries to switch out the zalloc/zfree functions after init (the situation that is causing the grief, of course), we ignore the values and use the matching free that it should be?

Dead2 · 2024-05-18T20:22:17Z

So as it stands now (after this PR), if the user tries to switch out the zalloc/zfree functions after init (the situation that is causing the grief, of course), we ignore the values and use the matching free that it should be?

Yes, assuming of course that the application doesn't set a mismatching free in the first place, but that would be a very clear bug in the application.

Dead2 · 2024-05-19T10:41:59Z

@iii-i I am getting ready to merge this, did you get anywhere with updating the DFLTCC support?

iii-i · 2024-05-21T11:19:03Z

I'm currently rebasing my DFLTCC change and I believe I found a bug with my fuzzer (see #1731 for a reproducer): inflateCopy() reuses the old window pointer. I propose the following fixup to this PR:

--- a/inflate.c
+++ b/inflate.c
@@ -1405,7 +1405,6 @@ int32_t Z_EXPORT PREFIX(inflateCopy)(PREFIX3(stream) *dest, PREFIX3(stream) *sou
     if (alloc_bufs == NULL)
         return Z_MEM_ERROR;
     copy = alloc_bufs->state;
-    copy->window = alloc_bufs->window;
 
     /* copy state */
     memcpy(copy, state, sizeof(struct inflate_state));
@@ -1416,6 +1415,7 @@ int32_t Z_EXPORT PREFIX(inflateCopy)(PREFIX3(stream) *dest, PREFIX3(stream) *sou
     }
     copy->next = copy->codes + (state->next - state->codes);
     copy->alloc_bufs = alloc_bufs;
+    copy->window = alloc_bufs->window;
 
     /* window */
     memcpy(copy->window, state->window, INFLATE_ADJUST_WINDOW_SIZE((size_t)state->wsize));

Edit: the current version is here: https://github.com/iii-i/zlib-ng/releases/tag/single-malloc-dfltcc-20240521. It's based on this PR + fixup + cleanup from @phprus + the reproducer. I think it would be better if you could just squash the DFLTCC commit into this PR, since otherwise bisect will be broken on s390x in the future.

Dead2 · 2024-05-21T14:30:56Z

@iii-i Merged in your changes, and squashed one of my commits into the others to try to make things always be in a working state for each commit. Please have a look in case I made any merging errors.

iii-i

I tested the PR on s390x and everything look good. Going over code, I noticed a couple nits, but otherwise LGTM.

CMakeLists.txt

deflate.c

Deflate used to call allocate 5 times during init. - 5 calls to external alloc function now becomes 1 - Handling alignment of allocated buffers is simplified - Efforts to align the allocated buffer now needs to happen only once. - Individual buffers are ordered so that they have natural sequential alignment. - Due to reduced losses to alignment, we allocate less memory in total. - While doing alloc(), we now store pointer to corresponding free(), avoiding crashes with applications that incorrectly set alloc/free pointers after running init function. - Removed need for extra padding after window, chunked reads can now go beyond the window buffer without causing a segfault. Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>

Inflate used to allocate state during init, but window would be allocated when/if needed and could be resized and that required a new free/alloc round. - Now, we allocate state and a 32K window during init, allowing the latency cost of allocs to be done during init instead of at one or more times later. - Total memory allocation is about the same when requesting a 32K window, but if now window or a smaller window was requested, then it is an increase. - While doing alloc(), we now store pointer to corresponding free(), avoiding crashes with applications that incorrectly set alloc/free pointers after running init function. - After init has succeeded, inflate will no longer possibly fail due to a failing malloc. Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>

… tests. Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>

Dead2 added optimization cleanup Improving maintainability or removing code. labels Apr 23, 2024

github-advanced-security bot found potential problems Apr 23, 2024

View reviewed changes

Dead2 force-pushed the single-malloc branch from e9c7f08 to 719ca7a Compare April 23, 2024 10:20

mtl1979 reviewed Apr 23, 2024

View reviewed changes

zbuild.h Outdated Show resolved Hide resolved

Dead2 force-pushed the single-malloc branch from 719ca7a to ad59c1a Compare April 23, 2024 17:26

nmoinvaz reviewed Apr 23, 2024

View reviewed changes

fontivan mentioned this pull request Apr 30, 2024

Linux x64: segfault by Unity/Mono #1708

Open

Dead2 force-pushed the single-malloc branch 2 times, most recently from e31a7e8 to 6d6120c Compare May 6, 2024 08:42

Dead2 marked this pull request as ready for review May 6, 2024 08:51

Dead2 force-pushed the single-malloc branch from 6d6120c to 026ae43 Compare May 6, 2024 08:58

Dead2 force-pushed the single-malloc branch from 026ae43 to 0719760 Compare May 6, 2024 20:55

Nazfib mentioned this pull request May 11, 2024

Game crashes when parts exploding KSP-RO/RP-1#2359

Open

Dead2 force-pushed the single-malloc branch 3 times, most recently from 95a7cfe to 4f1b133 Compare May 17, 2024 13:34

Prepare DFLTCC changes for new malloc system

549941e

Dead2 force-pushed the single-malloc branch from 4f1b133 to f4372f1 Compare May 21, 2024 14:29

iii-i approved these changes May 21, 2024

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

deflate.c Show resolved Hide resolved

Dead2 and others added 3 commits May 21, 2024 21:11

Clean up memory allocation functions that are no longer used, and its…

a4b111f

… tests. Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>

Dead2 force-pushed the single-malloc branch from f4372f1 to a4b111f Compare May 21, 2024 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite deflate memory allocation. #1713

Rewrite deflate memory allocation. #1713

Dead2 commented Apr 23, 2024 •

edited

Dead2 commented Apr 23, 2024

nmoinvaz Apr 23, 2024

Dead2 Apr 23, 2024

Dead2 Apr 23, 2024

iii-i commented Apr 30, 2024 •

edited

Dead2 commented Apr 30, 2024

iii-i commented May 2, 2024

nmoinvaz commented May 6, 2024

Dead2 commented May 6, 2024

Dead2 commented May 6, 2024

iii-i commented May 6, 2024

iii-i commented May 6, 2024

phprus commented May 18, 2024

KungFuJesus commented May 18, 2024 •

edited

Dead2 commented May 18, 2024

KungFuJesus commented May 18, 2024

Dead2 commented May 18, 2024

Dead2 commented May 19, 2024

iii-i commented May 21, 2024 •

edited

Dead2 commented May 21, 2024

iii-i left a comment

Rewrite deflate memory allocation. #1713

Are you sure you want to change the base?

Rewrite deflate memory allocation. #1713

Conversation

Dead2 commented Apr 23, 2024 • edited

Deflate

Inflate

Performance

Dead2 commented Apr 23, 2024

nmoinvaz Apr 23, 2024

Choose a reason for hiding this comment

Dead2 Apr 23, 2024

Choose a reason for hiding this comment

Dead2 Apr 23, 2024

Choose a reason for hiding this comment

iii-i commented Apr 30, 2024 • edited

Dead2 commented Apr 30, 2024

iii-i commented May 2, 2024

nmoinvaz commented May 6, 2024

Dead2 commented May 6, 2024

Dead2 commented May 6, 2024

iii-i commented May 6, 2024

iii-i commented May 6, 2024

phprus commented May 18, 2024

KungFuJesus commented May 18, 2024 • edited

Dead2 commented May 18, 2024

KungFuJesus commented May 18, 2024

Dead2 commented May 18, 2024

Dead2 commented May 19, 2024

iii-i commented May 21, 2024 • edited

Dead2 commented May 21, 2024

iii-i left a comment

Choose a reason for hiding this comment

Dead2 commented Apr 23, 2024 •

edited

iii-i commented Apr 30, 2024 •

edited

KungFuJesus commented May 18, 2024 •

edited

iii-i commented May 21, 2024 •

edited