Constructing Zip::File instances with many entries is very expensive #506

mttkay · 2021-11-16T09:12:07Z

At GitLab, we found a performance regression where engineers were counting zip file entries via Zip::File by iterating entries. We found that the performance issue is not in the Enumerable itself, but rather in Zip::File.new: this will read the entire Central Directory into memory. We tested this against a zip file that contained a million entries, which compressed to 93MB on disk and found that on a reasonably modern setup, merely instantiating that class took 18 seconds and consumed 1.2 GB of memory:

$ ll all.zip 
92M -rw-rw-r-- 1 mk 92M Nov 12 18:51 all.zip
$ unzip -lq all.zip | grep -c '.dat'        
1000000

CPU time:

$ time ./zipbench.rb
1000000
./zipbench.rb  17.15s user 0.81s system 100% cpu 17.960 total

Memory use (RSS):

pidstat -r -p 2325497 1
Linux 5.13.0-7620-generic (carbon) 	11/12/2021 	_x86_64_	(16 CPU)

06:55:34 PM   UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
06:55:35 PM  1000   2325497   8199.00      0.00  463216  432008   1.33  ruby
06:55:36 PM  1000   2325497  22481.00      0.00  546112  493284   1.52  ruby
06:55:37 PM  1000   2325497  14279.00      0.00  599812  550308   1.69  ruby
06:55:38 PM  1000   2325497  18462.00      0.00  669708  624228   1.92  ruby
06:55:39 PM  1000   2325497  20619.00      0.00  747660  706596   2.18  ruby
06:55:40 PM  1000   2325497  14403.00      0.00  801976  764412   2.35  ruby
06:55:41 PM  1000   2325497   6707.00      0.00  825860  791076   2.44  ruby
06:55:42 PM  1000   2325497  20615.00      0.00  912896  873708   2.69  ruby
06:55:43 PM  1000   2325497  26821.00      0.00 1017272  980892   3.02  ruby
06:55:44 PM  1000   2325497  17979.00      0.00 1087152 1052700   3.24  ruby
06:55:45 PM  1000   2325497  23313.00      0.00 1180868 1146156   3.53  ruby
06:55:46 PM  1000   2325497   8419.00      0.00 1246412 1179684   3.63  ruby

The zip file was created with this shell script: https://gitlab.com/gitlab-org/gitlab/-/snippets/2206686

The benchmark was a simple script that called File.new(path, create = false, buffer = true): https://gitlab.com/gitlab-org/gitlab/-/merge_requests/73391#note_731984916

We use rubyzip 2.0.x but I found the issue to affect the latest release as well, 2.3.2 at the time of this writing.

The important take-away here is that it was not the iteration that consumed so much memory; it is the fact that a very large EntrySet is assembled on the Ruby heap. This means that even methods such as Zip::File#size are vulnerable to this, since it is an instance method that depends on the set size in memory. Its CPU and memory complexity is O(N) based on the number of zip entries.

We also found that Zip::InputStream is not a good way to determine entry count, since while it is memory efficient, it iterates every local entry, which means it has linear complexity with regards to the number of entries.

I think the easiest way to improve this is to provide an API that reads the zip entry count directly from EOCD instead of loading the CD entirely or iterating entries directly. Something like:

class Zip::File
  def self.entry_count(io)
    cdir = Zip::CentralDirectory.new
    cdir.read_32_or_64_e_o_c_d(io)
    cdir.entry_count
  end
end

class Zip::CentralDirectory
  def entry_count
    @size
  end

  private

  def read_32_or_64_e_o_c_d(io)
    buf = start_buf(io)
    if zip64_file?(buf)
      read_64_e_o_c_d(buf)
    else
      read_e_o_c_d(buf)
    end
  end
end

My questions are:

Are there existing APIs that I missed which accomplish the same? I am not familiar with rubyzip.
If not, does the suggested approach make sense to you? What are the downsides? I did notice that ZIP32 only uses a 2B int to store entry count, so this would never report more than 65k entries (I did not test whether it's possible to add more than that to a ZIP32 file)
What is a user friendly API to count entries, and how can we provide safer defaults? I also noticed that the CentralDirectory initializer can take an Enumerable from which it will populate an EntrySet. This is not a safe default, because it leads to the same runaway memory issue described above.

The text was updated successfully, but these errors were encountered:

hainesr · 2021-11-16T23:07:19Z

Hi @mttkay, many thanks for this detailed report, and your analysis.

It's fair to say that rubyzip is pretty naïve when it comes to this sort of thing - and while I think that's partly down to the fact that the zip format itself is pretty naïve - we could definitely do better. Part of the issue is that there's no index in a zip file - the Central Directory is the index - so at some point if you want to do anything, like find a file within the archive, you have to have ingested the whole thing in some form, and even then in some cases you also have to read the local header as well to ensure you have all the data you need. That said, I know we're not as efficient as we could be, and that may have been OK in the Zip32 days... I've been wondering about how to make these sorts of things quicker - like only loading stuff into ruby objects when we really need them - and this issue has prompted me to think about it further.

To answer your questions:

I don't think you're missing anything!
I think your suggested approach is good, and is certainly worth putting in while I think about the above. The Zip64 extensions - which we do support - use an 8B int to store the entry count, which should be enough for anybody, although if you did manage to get that many files into your zip, presumably you'd be worried about the heat-death of the universe before rubyzip had read the Central Directory 🤣
I'll think about the safer defaults too. The CentralDirectory initializer isn't really part of the public API. It takes an enumerable from another part of the API when using OutputStream to build a zip, but even then I can think of other ways to provide that functionality. TBH there's loads of things in rubyzip I'd like to do differently, and I'm ticking them off slowly...

I've been thinking about your million-entry zip... The Central Directory header is 46B and the Local Header is 30B, so that leaves, give or take, 21B per entry for the filename and payload. The filename is repeated in both headers (!) so not a large payload per file...

The zip file was created with this shell script: https://gitlab.com/gitlab-org/gitlab/-/snippets/2206686

Unfortunately I don't have permission to see that script; please can you drop it somewhere else or attach it to this issue? I love to be able to test against it as well.

mttkay · 2021-11-18T09:06:28Z

* if you did manage to get that many files into your zip, presumably you'd be worried about the heat-death of the universe before rubyzip had read the Central Directory

Hah, yep -- the script I used to create that file took several hours to run, too... I made it public, thanks for pointing that out.

As for the solution I suggested, we found a very strange issue during a code review, where the reviewer pointed out that when reading the entry count from EOCD, is was yielding different values for them on macOS than for me on Linux. At least it appeared to be a platform issue, since I had another co-worker confirm the bogus value read, and they were also on macOS.
So it might not be a universal replacement for iterating CD entries and counting them out.

While it's not directly related to what's discussed here, I could be worth investigating in parallel. I am trying to create a simple, reproducible test case, but I need to rely on my co-worker's help for this since I don't have a mac myself and on Linux it returns the correct values. As far as our debugging revealed, CentralDirectory ended up reading a highly inflated value into @total_number_of_entries_in_cdir_on_this_disk (21k+), when the true entry count for that archive was 6.

Meanwhile, if you have any ideas as to where there are platform specific switches in rubyzip that could be causing this, let me know!

hainesr · 2021-11-20T17:47:11Z

the script I used to create that file took several hours to run, too...

Yes, it's adding files one by one, rerunning zip each time: the new file has to be inserted into the archive and the central directory needs to be added to, which means re-writing a significant portion of the archive each time. Creating all the files in a directory and adding them in one invocation of zip should be a lot quicker.

I have the beginnings of a fix for this issue here. I've added Zip::File::count_entries which, as you suggest above, just reads the number of entries from the Central Directory.

It'll probably get merged in later today or tomorrow, but I'm not sure when I'll be able to get a version 3.0 gem cut - there are a lot of changes in 3.0, some of them breaking, so I need to make sure everything is right (and this isn't my day job 😄). I could backport this change to a quick version 2.4 release if that would be helpful to you. Do let me know.

As for the Mac OS behaviour - I'm not sure what is going on. We have the tests running on Mac OS in the CI, but I don't have a Mac to test on myself. If you do manage to find a reproducible test case that would be very useful.

hainesr · 2021-11-20T22:07:04Z

This fix is now merged to HEAD. Please give it a go and let me know if a backport would be useful.

mttkay · 2021-11-22T07:36:09Z

Excellent @hainesr !

This looks pretty great. If I'm reading this correctly, this also addresses the case where the CD is read entirely during open; I'm looking at 6a516fb
So open and new should now also be safe to use with files that have many entries correct?

As far as a backport goes, we're on a fairly old version of the gem (2.0.0), so I'm not sure how much effort that would be. Plus, we won't be able to use the new functionality in all cases since sometimes a feature is only interested in the number of files, and entry_count includes directories, which is a bummer.

I'd also want GitLab to update to the latest release of the library anyway, since it's not healthy to linger on older releases. I will create an issue in our tracker for that and watch this space for a new release.

Thanks for all the help and the fast turnaround time!

hainesr · 2021-11-22T08:24:40Z

Hi @mttkay, Thanks!

open still goes right on and reads all the entry data from the CD, so it is still slow for large numbers of entries. Sorting that out will take a bit more thought, so I'm mulling that over in the background for a bit. One thing I have thought so far is that the Entry objects are rather heavyweight: they are very much in the classic OO style and have all the code in them to read/write/manipulate them. Moving to lighter Entrys and readers/writers would surely help the high memory use you observed. I'd be interested in your thoughts.

sometimes a feature is only interested in the number of files, and entry_count includes directories

Yes, we'd need to read the entire CD in to disambiguate between files and directories unfortunately - which puts us back to square one. I'm also wondering about how to generally speed up reading the CD. Lighter weight objects would help, I'm sure, but also maybe we don't need to process the whole thing at that point for larger entry counts.

mttkay · 2021-11-22T09:22:51Z

Thanks for clarifying; yes, I think what you suggest is the right approach to think about this problem. My background with compression formats and algorithms is spotty, but my understanding of the status quo with zip is:

Metadata in EOCD: constant in time and memory use, but maybe "not the data you're looking for", so it is not a universal solution
Metadata in CD: linear time and memory use, but can answer more questions. Slimming down each record may provide a great benefit in memory use.
Local entry: actual entry data + header, linear in time but probably worst memory use unless only header is read

Several things that come to mind:

What information is in local entry headers vs CD entries that could be useful to extract? (e.g. the example of is it a file or directory)
How can we obtains this efficiently? Iteration is the only way to do this it seems, so the approach that Zip::InputStream takes is good: it frees memory after yielding each entry. This is still slow, but at least memory use is constant. But this currently only iterates local entries. Would it be useful to have a similar iterator that efficiently traverses CD records instead of local entries? (Think: Zip::CentralDirectory::InputStream?)
Would it be useful to provide an iterator that only yields local entry headers, not the entire entry incl. data?

Just some thoughts -- again, I'm not too familiar with this space but that's how I would personally go about it.

hainesr · 2021-11-22T23:04:03Z

Thanks @mttkay, I'm finding it useful to "rubber duck" some of this out 😄

Yes, reading the EOCD is quick and cheap, but it only really tells you: how many entries are in the CD, and where the start of the CD is on disk. If Zip64 is being used you also need to read the Zip64 EOCD - and you do need to read both because the spec allows for only the data that doesn't fit in the EOCD to go in the Zip64 EOCD.
Yes, the only way to navigate the CD is to start at the start and iterate through it. Every entry in the CD will be a different size, thanks to different entry name lengths and different amounts of data in the extra fields, so you have to read every entry in to find the next one. In practice the extra fields are often the same for each file, but you cannot assume this. What the CD entries do give you though is the entry name, the "filetype", the sizes before and after compression, crc and where the entry itself is on disk. If the entry needs Zip64 data then that is stored in the extra fields.
The Local entry header has less metadata in it that the CD entry, but a lot of extra fields in the CD are just markers with incomplete data that needs to be merged with the extra fields in the local header - this drives me mad, personally. So to be really sure you have absolutely all the metadata for an entry you have to read both the CD metadata and the Local header metadata. This is all made harder if Zip64 metadata is involved as well.

I think it's 3 that makes rubyzip slow for big files because we try and collect all this data up front. This means a lot of jumping around in the file, which I am sure is horribly slow for huge archives, such as your 1,000,000 entry one. I reckon that there are ways to speed this up, and load more stuff "just in time" (alongside lighter weight objects) so I will look at this. The only real index available for a Zip archive is the entry name, so I think that's a good start for minimal metadata extraction...

I'll think about whether there is anything we could do more usefully with an InputStream-type approach to the CD. I guess it would be more of an "inspection" interface for the archive, as opposed to an "extraction" interface?

I have to keep reminding myself that Zip is really a 16bit, floppy disk era file format that was useful enough to survive into the 64bit, TB SSD era 🤣

mttkay · 2021-11-23T09:34:00Z

Again, thanks for you detailed response -- I'm really enjoying this because I'm learning a lot about ZIP :-)

This means a lot of jumping around in the file, which I am sure is horribly slow for huge archives, such as your 1,000,000 entry one.

This is a really interesting point actually -- it reminds me that this isn't even always possible, because in the case I investigated our app wasn't even reading the archive from disk, it was streaming it from an object storage provider (GCP). Unless the entire response is buffered in memory or on disk, there would be no random access anyway since you cannot perform random seeks in a TCP stream as far as I'm aware. Even if you could, it's not something you'd want to do.

One thing I want to say is: often it is simply not possible or reasonable to optimize both for performance and egonomics. I know Ruby tends to optimize for developer ergonomics, which typically comes with some kind of performance drag, and it makes sense for a library like rubyzip to follow that spirit. Plus, these issues often only rear their heads in the long tail of the distribution i.e. in 99% of the cases it is not actually an issue, though unfortunately the 1% is often where the important customers and power-users reside who feel the brunt of the issue when it occurs (this makes my job very difficult sometimes!)

I think a good compromise is therefore often to not optimize away interface ergonomics for the sake of performance but rather put sensible constraints on how or when the interface is used. In our case I think a reasonable solution is to simply not count files if the EOCD entry count suggests that we'd have to traverse a lot of entries to determine that exact number. (This is also what I have suggested the co-worker should do for now)

I guess what I'm trying to say is: you're the best person to judge the effort involved here in providing a solution that would be both fast and precise, but I think saying it's not worth the effort can be a good answer too. The fix you provides is already useful because it allows us to least get an idea of entry quantity to make follow-up decisions.

This also means that we no longer need to keep a copy of the original set of `Entry`s or the central directory comment to test for changes. For situations where a zip file has a lot of entries (e.g. rubyzip#506) this means we save a lot of memory, and a lot of time constructing the zip file in memory.

hainesr · 2022-02-12T08:14:40Z

Apologies for not replying for so long @mttkay.

I have made a little progress on some of the performance/memory issues: for some reason we were copying the entire set of Entrys within a File so we could compare and track changes. This might just about make sense when we're talking about archives with low tens of entries, but in the Zip64 world it's a nonsense. So I've implemented a system to mark changed entries as 'dirty' so we don't need to do all that copying anymore. It's not merged yet but should be soon (hainesr@bfd528d). I'll keep this issue open while I keep plugging away at these things.

Also, I want to thank you for your counsel about the balance between performance and developer ergonomics. I agree it's a very important consideration and I will strive to preserve the ergonomics - it's why ruby is such a great language to work with, after all. I think there's loads of stuff I can still do to improve performance without impacting ergonomics - just need a bit of time to work through it all.

mttkay · 2022-02-14T09:14:23Z

This sounds great, thanks! Appreciate your work 🙇

This also means that we no longer need to keep a copy of the original set of `Entry`s or the central directory comment to test for changes. For situations where a zip file has a lot of entries (e.g. rubyzip#506) this means we save a lot of memory, and a lot of time constructing the zip file in memory.

hainesr self-assigned this Nov 16, 2021

hainesr added the enhancement label Nov 16, 2021

mttkay mentioned this issue Nov 22, 2021

Archive with 6 entries only yields 1 #510

Closed

hainesr added this to the 3.0 milestone Nov 22, 2021

hainesr added the bug label Nov 22, 2021

hainesr modified the milestones: 3.0, Future Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constructing Zip::File instances with many entries is very expensive #506

Constructing Zip::File instances with many entries is very expensive #506

mttkay commented Nov 16, 2021 •

edited

hainesr commented Nov 16, 2021

mttkay commented Nov 18, 2021

hainesr commented Nov 20, 2021

hainesr commented Nov 20, 2021

mttkay commented Nov 22, 2021

hainesr commented Nov 22, 2021

mttkay commented Nov 22, 2021

hainesr commented Nov 22, 2021

mttkay commented Nov 23, 2021

hainesr commented Feb 12, 2022

mttkay commented Feb 14, 2022

Constructing Zip::File instances with many entries is very expensive #506

Constructing Zip::File instances with many entries is very expensive #506

Comments

mttkay commented Nov 16, 2021 • edited

hainesr commented Nov 16, 2021

mttkay commented Nov 18, 2021

hainesr commented Nov 20, 2021

hainesr commented Nov 20, 2021

mttkay commented Nov 22, 2021

hainesr commented Nov 22, 2021

mttkay commented Nov 22, 2021

hainesr commented Nov 22, 2021

mttkay commented Nov 23, 2021

hainesr commented Feb 12, 2022

mttkay commented Feb 14, 2022

mttkay commented Nov 16, 2021 •

edited