Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite deflate memory allocation. #1713

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from
Open

Rewrite deflate memory allocation. #1713

wants to merge 4 commits into from

Conversation

Dead2
Copy link
Member

@Dead2 Dead2 commented Apr 23, 2024

This PR currently lacks s390x support entirely. @iii-i is working on updating that support.

This PR moves all allocations to come from a single buffer allocated during init, this has many positive sides such as being faster overall, and in the case of inflate; ensuring that as much as possible of the startup cost is done during init, leading to more predictable behavior and performance of inflate itself.
I have also added debug logging to the allocation functions, as it is now simple and self-contained, although it could likely use some improvements to be more portable.
This should also fix #1708 and similar errors, although that is not the motivation for these changes.

Deflate

Deflate used to call allocate 5 times during init.

  • 5 calls to external alloc function now becomes 1
  • Handling alignment of allocated buffers is simplified
    • Efforts to align the allocated buffer now needs to happen only once.
    • Individual buffers are ordered so that they have natural sequential alignment.
  • Due to reduced losses to alignment, we allocate less memory in total.
  • While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
    with applications that incorrectly set alloc/free pointers after running init function.
  • Removed need for extra padding after window, chunked reads can now go beyond the window
    buffer without causing a segfault.

Inflate

Inflate used to allocate state during init, but window would be allocated
when/if needed and could be resized and that required a new free/alloc round.

  • Now, we allocate state and a 32K window during init, allowing the latency cost
    of allocs to be done during init instead of at one or more times later.
  • Total memory allocation is about the same when requesting a 32K window, but
    if now window or a smaller window was requested, then it is an increase.
  • While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
    with applications that incorrectly set alloc/free pointers after running init function.
  • After init has succeeded, inflate will no longer possibly fail due to a failing malloc.

Performance

Preliminary benchmarks show that compress() of 1 byte is ~9% faster.
Compress() of 32KiB is 3.6% faster.
Bigger files show less improvements due to most of the savings happening during deflate init().
Benchmark used is in PR #1721

@Dead2 Dead2 added optimization cleanup Improving maintainability or removing code. labels Apr 23, 2024
/* ===========================================================================
* Free all allocated deflate buffers
*/
static inline void free_deflate(PREFIX3(stream) *strm) {

Check notice

Code scanning / CodeQL

Unused static function Note

Static function free_deflate is unreachable
zbuild.h Outdated Show resolved Hide resolved
@Dead2
Copy link
Member Author

Dead2 commented Apr 23, 2024

@iii-i Any chance you have the time for commenting on the feasibility for porting the DFLTCC code to using this allocator and removing most of the custom allocation code? Any obvious problems that need to be caught early?

I have looked at the code, and I think it is doable, but I don't really know that part of the code at all, and it'd probably require trial-and-error to implement this I think. If you want to contribute a patch, that'd be awesome too. Worst case (I really want to avoid doing that though) is doing a point release without s390x DFLTCC support while it gets brought up to speed.

inflate.c Outdated Show resolved Hide resolved
inflate.c Outdated Show resolved Hide resolved
deflate.c Show resolved Hide resolved
} \
BENCHMARK_REGISTER_F(compress, name)->Arg(1)->Arg(8)->Arg(16)->Arg(32)->Arg(64)->Arg(512)->Arg(4<<10)->Arg(32<<10);

BENCHMARK_COMPRESS(compress);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually benchmark would test multiple versions of similar functions, but here there is only one. What is the purpose of this file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR:

compress/compress/1           6157 ns         6147 ns       105599
compress/compress/8           6666 ns         6654 ns       100625
compress/compress/16          6873 ns         6862 ns        99180
compress/compress/32          7663 ns         7650 ns        89458
compress/compress/64          8265 ns         8251 ns        83929
compress/compress/512         8208 ns         8194 ns        82918
compress/compress/4096        9570 ns         9553 ns        72661
compress/compress/32768      19308 ns        19272 ns        36328

With this PR:

compress/compress/1           5618 ns         5610 ns       113507
compress/compress/8           5984 ns         5974 ns       109259
compress/compress/16          6097 ns         6087 ns       111210
compress/compress/32          6828 ns         6817 ns       100269
compress/compress/64          7435 ns         7424 ns        91821
compress/compress/512         7417 ns         7405 ns        91981
compress/compress/4096        8785 ns         8771 ns        79245
compress/compress/32768      18541 ns        18510 ns        37689

PS: Neither of these are up to date, but they do illustrate the point of using them for comparisons across different commits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarking this PR with deflatebench is pretty useless, it has too much overhead to reliably distinguish speed differences for compressing/decompressing such small files. It can be seen on the total runtime over hundreds of runs on small files.
This is a much better benchmark for that kind of thing since it runs the benchmark a lot closer to the library itself and without all the forking and external monitoring and such.

@iii-i
Copy link
Member

iii-i commented Apr 30, 2024

Hi, I think there are no huge compatibility problems for DFLTCC in this PR. I currently have a draft commit that adapts the code and survives the regtest: https://github.com/iii-i/zlib-ng/releases/tag/single-malloc-dfltcc-20240430.

@Dead2
Copy link
Member Author

Dead2 commented Apr 30, 2024

Hi, I think there are no huge compatibility problems for DFLTCC in this PR. I currently have a draft commit that adapts the code and survives the regtest: iii-i/zlib-ng@single-malloc-dfltcc-20240430 (release).

That is great. Maybe you should do a PR for that based on top of this.
When looking through it, I notice it looks like a pretty substantial cleanup 👍.
I only noticed one thing I'd prefer changed; in deflate.h, move arch_deflate_state arch; ~10 lines up, to right below deflate_allocs *alloc_bufs;. That group has a natural 16-byte alignment, it should not have any negative effect whether it is there or not, and it avoids having it dangling at the end after the padding bytes 😄.
There might also be a potential for getting rid of dfltcc_common.c, as ZCOPY_WINDOW is only used from inflate.

@iii-i
Copy link
Member

iii-i commented May 2, 2024

Will do. Unfortunately I found some issues which I don't fully understand yet; but now I have at least a couple bite-sized commits that make sense on their own, which I will post as separate PRs (here is the first one: #1717).

@Dead2 Dead2 force-pushed the single-malloc branch 2 times, most recently from e31a7e8 to 6d6120c Compare May 6, 2024 08:42
@Dead2 Dead2 marked this pull request as ready for review May 6, 2024 08:51
@nmoinvaz
Copy link
Member

nmoinvaz commented May 6, 2024

Interesting removal of ZCOPY_DEFLATE_STATE and the one for inflate also. Is it no longer necessary? I wonder what kind of optimization DFLTCC was doing.

Configure script changes LGTM.

@Dead2
Copy link
Member Author

Dead2 commented May 6, 2024

Interesting removal of ZCOPY_DEFLATE_STATE and the one for inflate also. Is it no longer necessary? I wonder what kind of optimization DFLTCC was doing.

I was initially uncertain whether to keep those around or not, since I was unsure what DFLTCC needed to do. But when @iii-i posted his initial patch that also removed those, I figured I might as well do so in this patch where it fits with the other changes in the same areas, and then the s390x changes will be more self-contained and easier to review as well.

@Dead2
Copy link
Member Author

Dead2 commented May 6, 2024

Rebased

@iii-i
Copy link
Member

iii-i commented May 6, 2024

ZCOPY_DEFLATE_STATE is not an optimization - DFLTCC needs to maintain state of its own, it is allocated alongside deflate/inflate state, and needs to be copied on deflateCopy() / inflateCopy(). In order to make things as unintrusive as possible, in my zlib patch I chose not to modify the structs, but rather to introduce hooks and do pointer arithmetics there. This rewrite is an opportunity to simplify things (#1718) at the cost of making DFLTCC support a little bit more intrusive, which is hopefully okay.

@iii-i
Copy link
Member

iii-i commented May 6, 2024

P.S. I wonder in which order this and #1718 should go in? I'd prefer to land #1718 first, since it's more local and as far as I'm concerned is ready.

@phprus
Copy link
Contributor

phprus commented May 18, 2024

Fix for arch/s390/dfltcc_detail.h:222:12: error: call to undeclared function 'ZALLOC': #1728
Fix for CI: #1726, #1725

@KungFuJesus
Copy link
Contributor

KungFuJesus commented May 18, 2024

So, putting on some exploit mitigation goggles, I do worry about this one a tiny bit:

While doing alloc(), we now store pointer to corresponding free(), avoiding crashes with applications that incorrectly set alloc/free pointers after running init function.

Doesn't this mean we effectively are giving a user controlled pointer for every allocation? Granted we were already doing that at library initialization time, I'm just wondering if giving this to something possibly controllable by the user, not just the developer, becomes a bit dangerous. I get that this makes us resilient to a pretty brutal crash from mono but...it just seems like it might be easy to craft a gzip or png file that could readily exploit this. Yes, I realize that we are the allocators here and not the user, I'm just wondering if a well crafted input file could lie about remaining lengths such that we end up writing a malicious location behind a buffer. This is now several pointers that are vulnerable, rather than just one in the init structure that is outside of the purview of the user. Am I wrong or overly paranoid here?

@Dead2
Copy link
Member Author

Dead2 commented May 18, 2024

@KungFuJesus I'm not sure this PR really changes that.

The free pointer is set during init() both now and before, but now we copy it to somewhere where the application using zlib can't easily change it after init() (at least without accessing the structs and buffers that have no exported interfaces to make sense of them).

Doesn't this mean we effectively are giving a user controlled pointer for every allocation? Granted we were already doing that at library initialization time, I'm just wondering if giving this to something possibly controllable by the user, not just the developer, becomes a bit dangerous.

How does this become controllable by the user? The developer of the application using zlib needs to set this through the z_stream struct during init(). Who is the user here?

This is now several pointers that are vulnerable, rather than just one in the init structure that is outside of the purview of the user.

I am not sure what you mean by the init structure is outside the purview of the user. The "init structure" or z_stream is passed to zlib for every function call from the application, and contains the pointer both to free, malloc and to the internal state struct. (And the internal state struct also contains a pointer back to z_stream).

We also have other function pointers in structs in various places.
I'm not sure what makes this pointer to free any worse than those.

I fear we'd have to do a lot more invasive changes if we wanted to harden all our pointers from attacks. Possibly we could use mproctect() / VirtualProtect(), but that'd mean moving all function pointers to a separate page-sized+aligned buffer. Also it'd interfere with things like changing deflate levels.

@KungFuJesus
Copy link
Contributor

Sorry I realize some of this was poorly articulated. Looking at the code now, I see it's set in the strm struct, not per allocation as I had anticipated based on that summary. We're no worse off for that. Had it been a pointer at the beginning of the allocation it might have been another story (though, I think the inflate code always writes forward in the buffer, never backward).

So as it stands now (after this PR), if the user tries to switch out the zalloc/zfree functions after init (the situation that is causing the grief, of course), we ignore the values and use the matching free that it should be?

@Dead2
Copy link
Member Author

Dead2 commented May 18, 2024

So as it stands now (after this PR), if the user tries to switch out the zalloc/zfree functions after init (the situation that is causing the grief, of course), we ignore the values and use the matching free that it should be?

Yes, assuming of course that the application doesn't set a mismatching free in the first place, but that would be a very clear bug in the application.

@Dead2
Copy link
Member Author

Dead2 commented May 19, 2024

@iii-i I am getting ready to merge this, did you get anywhere with updating the DFLTCC support?

@iii-i
Copy link
Member

iii-i commented May 21, 2024

I'm currently rebasing my DFLTCC change and I believe I found a bug with my fuzzer (see #1731 for a reproducer): inflateCopy() reuses the old window pointer. I propose the following fixup to this PR:

--- a/inflate.c
+++ b/inflate.c
@@ -1405,7 +1405,6 @@ int32_t Z_EXPORT PREFIX(inflateCopy)(PREFIX3(stream) *dest, PREFIX3(stream) *sou
     if (alloc_bufs == NULL)
         return Z_MEM_ERROR;
     copy = alloc_bufs->state;
-    copy->window = alloc_bufs->window;
 
     /* copy state */
     memcpy(copy, state, sizeof(struct inflate_state));
@@ -1416,6 +1415,7 @@ int32_t Z_EXPORT PREFIX(inflateCopy)(PREFIX3(stream) *dest, PREFIX3(stream) *sou
     }
     copy->next = copy->codes + (state->next - state->codes);
     copy->alloc_bufs = alloc_bufs;
+    copy->window = alloc_bufs->window;
 
     /* window */
     memcpy(copy->window, state->window, INFLATE_ADJUST_WINDOW_SIZE((size_t)state->wsize));

Edit: the current version is here: https://github.com/iii-i/zlib-ng/releases/tag/single-malloc-dfltcc-20240521. It's based on this PR + fixup + cleanup from @phprus + the reproducer. I think it would be better if you could just squash the DFLTCC commit into this PR, since otherwise bisect will be broken on s390x in the future.

@Dead2
Copy link
Member Author

Dead2 commented May 21, 2024

@iii-i Merged in your changes, and squashed one of my commits into the others to try to make things always be in a working state for each commit. Please have a look in case I made any merging errors.

Copy link
Member

@iii-i iii-i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the PR on s390x and everything look good. Going over code, I noticed a couple nits, but otherwise LGTM.

CMakeLists.txt Outdated Show resolved Hide resolved
deflate.c Show resolved Hide resolved
Dead2 and others added 3 commits May 21, 2024 21:11
Deflate used to call allocate 5 times during init.

- 5 calls to external alloc function now becomes 1
- Handling alignment of allocated buffers is simplified
  - Efforts to align the allocated buffer now needs to happen only once.
  - Individual buffers are ordered so that they have natural sequential alignment.
- Due to reduced losses to alignment, we allocate less memory in total.
- While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
  with applications that incorrectly set alloc/free pointers after running init function.
- Removed need for extra padding after window, chunked reads can now go beyond the window
  buffer without causing a segfault.

Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
Inflate used to allocate state during init, but window would be allocated
when/if needed and could be resized and that required a new free/alloc round.

- Now, we allocate state and a 32K window during init, allowing the latency cost
  of allocs to be done during init instead of at one or more times later.
- Total memory allocation is about the same when requesting a 32K window, but
  if now window or a smaller window was requested, then it is an increase.
- While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
  with applications that incorrectly set alloc/free pointers after running init function.
- After init has succeeded, inflate will no longer possibly fail due to a failing malloc.

Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
… tests.

Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Improving maintainability or removing code. optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Linux x64: segfault by Unity/Mono
6 participants