New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite deflate memory allocation. #1713
base: develop
Are you sure you want to change the base?
Conversation
@iii-i Any chance you have the time for commenting on the feasibility for porting the DFLTCC code to using this allocator and removing most of the custom allocation code? Any obvious problems that need to be caught early? I have looked at the code, and I think it is doable, but I don't really know that part of the code at all, and it'd probably require trial-and-error to implement this I think. If you want to contribute a patch, that'd be awesome too. Worst case (I really want to avoid doing that though) is doing a point release without s390x DFLTCC support while it gets brought up to speed. |
} \ | ||
BENCHMARK_REGISTER_F(compress, name)->Arg(1)->Arg(8)->Arg(16)->Arg(32)->Arg(64)->Arg(512)->Arg(4<<10)->Arg(32<<10); | ||
|
||
BENCHMARK_COMPRESS(compress); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually benchmark would test multiple versions of similar functions, but here there is only one. What is the purpose of this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR:
compress/compress/1 6157 ns 6147 ns 105599
compress/compress/8 6666 ns 6654 ns 100625
compress/compress/16 6873 ns 6862 ns 99180
compress/compress/32 7663 ns 7650 ns 89458
compress/compress/64 8265 ns 8251 ns 83929
compress/compress/512 8208 ns 8194 ns 82918
compress/compress/4096 9570 ns 9553 ns 72661
compress/compress/32768 19308 ns 19272 ns 36328
With this PR:
compress/compress/1 5618 ns 5610 ns 113507
compress/compress/8 5984 ns 5974 ns 109259
compress/compress/16 6097 ns 6087 ns 111210
compress/compress/32 6828 ns 6817 ns 100269
compress/compress/64 7435 ns 7424 ns 91821
compress/compress/512 7417 ns 7405 ns 91981
compress/compress/4096 8785 ns 8771 ns 79245
compress/compress/32768 18541 ns 18510 ns 37689
PS: Neither of these are up to date, but they do illustrate the point of using them for comparisons across different commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarking this PR with deflatebench is pretty useless, it has too much overhead to reliably distinguish speed differences for compressing/decompressing such small files. It can be seen on the total runtime over hundreds of runs on small files.
This is a much better benchmark for that kind of thing since it runs the benchmark a lot closer to the library itself and without all the forking and external monitoring and such.
Hi, I think there are no huge compatibility problems for DFLTCC in this PR. I currently have a draft commit that adapts the code and survives the regtest: https://github.com/iii-i/zlib-ng/releases/tag/single-malloc-dfltcc-20240430. |
That is great. Maybe you should do a PR for that based on top of this. |
Will do. Unfortunately I found some issues which I don't fully understand yet; but now I have at least a couple bite-sized commits that make sense on their own, which I will post as separate PRs (here is the first one: #1717). |
e31a7e8
to
6d6120c
Compare
Interesting removal of Configure script changes LGTM. |
I was initially uncertain whether to keep those around or not, since I was unsure what DFLTCC needed to do. But when @iii-i posted his initial patch that also removed those, I figured I might as well do so in this patch where it fits with the other changes in the same areas, and then the s390x changes will be more self-contained and easier to review as well. |
Rebased |
|
95a7cfe
to
4f1b133
Compare
So, putting on some exploit mitigation goggles, I do worry about this one a tiny bit:
Doesn't this mean we effectively are giving a user controlled pointer for every allocation? Granted we were already doing that at library initialization time, I'm just wondering if giving this to something possibly controllable by the user, not just the developer, becomes a bit dangerous. I get that this makes us resilient to a pretty brutal crash from mono but...it just seems like it might be easy to craft a gzip or png file that could readily exploit this. Yes, I realize that we are the allocators here and not the user, I'm just wondering if a well crafted input file could lie about remaining lengths such that we end up writing a malicious location behind a buffer. This is now several pointers that are vulnerable, rather than just one in the init structure that is outside of the purview of the user. Am I wrong or overly paranoid here? |
@KungFuJesus I'm not sure this PR really changes that. The free pointer is set during init() both now and before, but now we copy it to somewhere where the application using zlib can't easily change it after init() (at least without accessing the structs and buffers that have no exported interfaces to make sense of them).
How does this become controllable by the user? The developer of the application using zlib needs to set this through the
I am not sure what you mean by the init structure is outside the purview of the user. The "init structure" or We also have other function pointers in structs in various places. I fear we'd have to do a lot more invasive changes if we wanted to harden all our pointers from attacks. Possibly we could use |
Sorry I realize some of this was poorly articulated. Looking at the code now, I see it's set in the strm struct, not per allocation as I had anticipated based on that summary. We're no worse off for that. Had it been a pointer at the beginning of the allocation it might have been another story (though, I think the inflate code always writes forward in the buffer, never backward). So as it stands now (after this PR), if the user tries to switch out the zalloc/zfree functions after init (the situation that is causing the grief, of course), we ignore the values and use the matching free that it should be? |
Yes, assuming of course that the application doesn't set a mismatching |
@iii-i I am getting ready to merge this, did you get anywhere with updating the DFLTCC support? |
I'm currently rebasing my DFLTCC change and I believe I found a bug with my fuzzer (see #1731 for a reproducer):
Edit: the current version is here: https://github.com/iii-i/zlib-ng/releases/tag/single-malloc-dfltcc-20240521. It's based on this PR + fixup + cleanup from @phprus + the reproducer. I think it would be better if you could just squash the DFLTCC commit into this PR, since otherwise bisect will be broken on s390x in the future. |
@iii-i Merged in your changes, and squashed one of my commits into the others to try to make things always be in a working state for each commit. Please have a look in case I made any merging errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the PR on s390x and everything look good. Going over code, I noticed a couple nits, but otherwise LGTM.
Deflate used to call allocate 5 times during init. - 5 calls to external alloc function now becomes 1 - Handling alignment of allocated buffers is simplified - Efforts to align the allocated buffer now needs to happen only once. - Individual buffers are ordered so that they have natural sequential alignment. - Due to reduced losses to alignment, we allocate less memory in total. - While doing alloc(), we now store pointer to corresponding free(), avoiding crashes with applications that incorrectly set alloc/free pointers after running init function. - Removed need for extra padding after window, chunked reads can now go beyond the window buffer without causing a segfault. Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
Inflate used to allocate state during init, but window would be allocated when/if needed and could be resized and that required a new free/alloc round. - Now, we allocate state and a 32K window during init, allowing the latency cost of allocs to be done during init instead of at one or more times later. - Total memory allocation is about the same when requesting a 32K window, but if now window or a smaller window was requested, then it is an increase. - While doing alloc(), we now store pointer to corresponding free(), avoiding crashes with applications that incorrectly set alloc/free pointers after running init function. - After init has succeeded, inflate will no longer possibly fail due to a failing malloc. Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
… tests. Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
This PR currently lacks s390x support entirely. @iii-i is working on updating that support.
This PR moves all allocations to come from a single buffer allocated during init, this has many positive sides such as being faster overall, and in the case of inflate; ensuring that as much as possible of the startup cost is done during init, leading to more predictable behavior and performance of inflate itself.
I have also added debug logging to the allocation functions, as it is now simple and self-contained, although it could likely use some improvements to be more portable.
This should also fix #1708 and similar errors, although that is not the motivation for these changes.
Deflate
Deflate used to call allocate 5 times during init.
with applications that incorrectly set alloc/free pointers after running init function.
buffer without causing a segfault.
Inflate
Inflate used to allocate state during init, but window would be allocated
when/if needed and could be resized and that required a new free/alloc round.
of allocs to be done during init instead of at one or more times later.
if now window or a smaller window was requested, then it is an increase.
with applications that incorrectly set alloc/free pointers after running init function.
Performance
Preliminary benchmarks show that compress() of 1 byte is ~9% faster.
Compress() of 32KiB is 3.6% faster.
Bigger files show less improvements due to most of the savings happening during deflate init().
Benchmark used is in PR #1721