New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constructing Zip::File instances with many entries is very expensive #506
Comments
Hi @mttkay, many thanks for this detailed report, and your analysis. It's fair to say that rubyzip is pretty naïve when it comes to this sort of thing - and while I think that's partly down to the fact that the zip format itself is pretty naïve - we could definitely do better. Part of the issue is that there's no index in a zip file - the Central Directory is the index - so at some point if you want to do anything, like find a file within the archive, you have to have ingested the whole thing in some form, and even then in some cases you also have to read the local header as well to ensure you have all the data you need. That said, I know we're not as efficient as we could be, and that may have been OK in the Zip32 days... I've been wondering about how to make these sorts of things quicker - like only loading stuff into ruby objects when we really need them - and this issue has prompted me to think about it further. To answer your questions:
I've been thinking about your million-entry zip... The Central Directory header is 46B and the Local Header is 30B, so that leaves, give or take, 21B per entry for the filename and payload. The filename is repeated in both headers (!) so not a large payload per file...
Unfortunately I don't have permission to see that script; please can you drop it somewhere else or attach it to this issue? I love to be able to test against it as well. |
Hah, yep -- the script I used to create that file took several hours to run, too... I made it public, thanks for pointing that out. As for the solution I suggested, we found a very strange issue during a code review, where the reviewer pointed out that when reading the entry count from EOCD, is was yielding different values for them on macOS than for me on Linux. At least it appeared to be a platform issue, since I had another co-worker confirm the bogus value read, and they were also on macOS. While it's not directly related to what's discussed here, I could be worth investigating in parallel. I am trying to create a simple, reproducible test case, but I need to rely on my co-worker's help for this since I don't have a mac myself and on Linux it returns the correct values. As far as our debugging revealed, Meanwhile, if you have any ideas as to where there are platform specific switches in rubyzip that could be causing this, let me know! |
Yes, it's adding files one by one, rerunning I have the beginnings of a fix for this issue here. I've added It'll probably get merged in later today or tomorrow, but I'm not sure when I'll be able to get a version 3.0 gem cut - there are a lot of changes in 3.0, some of them breaking, so I need to make sure everything is right (and this isn't my day job 😄). I could backport this change to a quick version 2.4 release if that would be helpful to you. Do let me know. As for the Mac OS behaviour - I'm not sure what is going on. We have the tests running on Mac OS in the CI, but I don't have a Mac to test on myself. If you do manage to find a reproducible test case that would be very useful. |
This fix is now merged to HEAD. Please give it a go and let me know if a backport would be useful. |
Excellent @hainesr ! This looks pretty great. If I'm reading this correctly, this also addresses the case where the CD is read entirely during As far as a backport goes, we're on a fairly old version of the gem (2.0.0), so I'm not sure how much effort that would be. Plus, we won't be able to use the new functionality in all cases since sometimes a feature is only interested in the number of files, and I'd also want GitLab to update to the latest release of the library anyway, since it's not healthy to linger on older releases. I will create an issue in our tracker for that and watch this space for a new release. Thanks for all the help and the fast turnaround time! |
Hi @mttkay, Thanks!
Yes, we'd need to read the entire CD in to disambiguate between files and directories unfortunately - which puts us back to square one. I'm also wondering about how to generally speed up reading the CD. Lighter weight objects would help, I'm sure, but also maybe we don't need to process the whole thing at that point for larger entry counts. |
Thanks for clarifying; yes, I think what you suggest is the right approach to think about this problem. My background with compression formats and algorithms is spotty, but my understanding of the status quo with zip is:
Several things that come to mind:
Just some thoughts -- again, I'm not too familiar with this space but that's how I would personally go about it. |
Thanks @mttkay, I'm finding it useful to "rubber duck" some of this out 😄
I think it's 3 that makes rubyzip slow for big files because we try and collect all this data up front. This means a lot of jumping around in the file, which I am sure is horribly slow for huge archives, such as your 1,000,000 entry one. I reckon that there are ways to speed this up, and load more stuff "just in time" (alongside lighter weight objects) so I will look at this. The only real index available for a Zip archive is the entry name, so I think that's a good start for minimal metadata extraction... I'll think about whether there is anything we could do more usefully with an I have to keep reminding myself that Zip is really a 16bit, floppy disk era file format that was useful enough to survive into the 64bit, TB SSD era 🤣 |
Again, thanks for you detailed response -- I'm really enjoying this because I'm learning a lot about ZIP :-)
This is a really interesting point actually -- it reminds me that this isn't even always possible, because in the case I investigated our app wasn't even reading the archive from disk, it was streaming it from an object storage provider (GCP). Unless the entire response is buffered in memory or on disk, there would be no random access anyway since you cannot perform random seeks in a TCP stream as far as I'm aware. Even if you could, it's not something you'd want to do. One thing I want to say is: often it is simply not possible or reasonable to optimize both for performance and egonomics. I know Ruby tends to optimize for developer ergonomics, which typically comes with some kind of performance drag, and it makes sense for a library like rubyzip to follow that spirit. Plus, these issues often only rear their heads in the long tail of the distribution i.e. in 99% of the cases it is not actually an issue, though unfortunately the 1% is often where the important customers and power-users reside who feel the brunt of the issue when it occurs (this makes my job very difficult sometimes!) I think a good compromise is therefore often to not optimize away interface ergonomics for the sake of performance but rather put sensible constraints on how or when the interface is used. In our case I think a reasonable solution is to simply not count files if the EOCD entry count suggests that we'd have to traverse a lot of entries to determine that exact number. (This is also what I have suggested the co-worker should do for now) I guess what I'm trying to say is: you're the best person to judge the effort involved here in providing a solution that would be both fast and precise, but I think saying it's not worth the effort can be a good answer too. The fix you provides is already useful because it allows us to least get an idea of entry quantity to make follow-up decisions. |
This also means that we no longer need to keep a copy of the original set of `Entry`s or the central directory comment to test for changes. For situations where a zip file has a lot of entries (e.g. rubyzip#506) this means we save a lot of memory, and a lot of time constructing the zip file in memory.
Apologies for not replying for so long @mttkay. I have made a little progress on some of the performance/memory issues: for some reason we were copying the entire set of Also, I want to thank you for your counsel about the balance between performance and developer ergonomics. I agree it's a very important consideration and I will strive to preserve the ergonomics - it's why ruby is such a great language to work with, after all. I think there's loads of stuff I can still do to improve performance without impacting ergonomics - just need a bit of time to work through it all. |
This sounds great, thanks! Appreciate your work 🙇 |
This also means that we no longer need to keep a copy of the original set of `Entry`s or the central directory comment to test for changes. For situations where a zip file has a lot of entries (e.g. rubyzip#506) this means we save a lot of memory, and a lot of time constructing the zip file in memory.
This also means that we no longer need to keep a copy of the original set of `Entry`s or the central directory comment to test for changes. For situations where a zip file has a lot of entries (e.g. rubyzip#506) this means we save a lot of memory, and a lot of time constructing the zip file in memory.
This also means that we no longer need to keep a copy of the original set of `Entry`s or the central directory comment to test for changes. For situations where a zip file has a lot of entries (e.g. rubyzip#506) this means we save a lot of memory, and a lot of time constructing the zip file in memory.
This also means that we no longer need to keep a copy of the original set of `Entry`s or the central directory comment to test for changes. For situations where a zip file has a lot of entries (e.g. rubyzip#506) this means we save a lot of memory, and a lot of time constructing the zip file in memory.
At GitLab, we found a performance regression where engineers were counting zip file entries via
Zip::File
by iterating entries. We found that the performance issue is not in theEnumerable
itself, but rather inZip::File.new
: this will read the entire Central Directory into memory. We tested this against a zip file that contained a million entries, which compressed to 93MB on disk and found that on a reasonably modern setup, merely instantiating that class took 18 seconds and consumed 1.2 GB of memory:CPU time:
$ time ./zipbench.rb 1000000 ./zipbench.rb 17.15s user 0.81s system 100% cpu 17.960 total
Memory use (RSS):
The zip file was created with this shell script: https://gitlab.com/gitlab-org/gitlab/-/snippets/2206686
The benchmark was a simple script that called
File.new(path, create = false, buffer = true)
: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/73391#note_731984916We use rubyzip 2.0.x but I found the issue to affect the latest release as well, 2.3.2 at the time of this writing.
The important take-away here is that it was not the iteration that consumed so much memory; it is the fact that a very large
EntrySet
is assembled on the Ruby heap. This means that even methods such asZip::File#size
are vulnerable to this, since it is an instance method that depends on the set size in memory. Its CPU and memory complexity is O(N) based on the number of zip entries.We also found that
Zip::InputStream
is not a good way to determine entry count, since while it is memory efficient, it iterates every local entry, which means it has linear complexity with regards to the number of entries.I think the easiest way to improve this is to provide an API that reads the zip entry count directly from EOCD instead of loading the CD entirely or iterating entries directly. Something like:
My questions are:
The text was updated successfully, but these errors were encountered: