Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive with 6 entries only yields 1 #510

Closed
mttkay opened this issue Nov 22, 2021 · 10 comments
Closed

Archive with 6 entries only yields 1 #510

mttkay opened this issue Nov 22, 2021 · 10 comments

Comments

@mttkay
Copy link

mttkay commented Nov 22, 2021

I'm puzzled by this behavior and can't make sense of it: we have an archive with 6 files that was created with the Archive tool on macOS:

unzip -l Archive.zip 
Archive:  Archive.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
  4496468  2021-11-17 11:56   screencapture-about-gitlab-company-2021-11-17-11_56_36.png
      212  2021-11-17 11:56   __MACOSX/._screencapture-about-gitlab-company-2021-11-17-11_56_36.png
  4784368  2021-11-17 11:56   screencapture-about-gitlab-handbook-2021-11-17-11_56_06.pdf
      359  2021-11-17 11:56   __MACOSX/._screencapture-about-gitlab-handbook-2021-11-17-11_56_06.pdf
  2955106  2021-11-17 11:56   screencapture-about-gitlab-handbook-2021-11-17-11_56_06.png
      212  2021-11-17 11:56   __MACOSX/._screencapture-about-gitlab-handbook-2021-11-17-11_56_06.png
---------                     -------
 12236725                     6 files

It was just put together for testing. While working on #506, we found that the EOCD entry count for this archive is completely off, by several orders of magnitude (the field total_number_of_entries_in_cdir_on_this_disk), but only when running rubyzip over it on macOS (the count is correct on Linux).

Moreover, when iterating this archive via Zip::InputStream, it only ever yields a single entry, and this happens on Linux, too (maybe these problems are unrelated):

#!/usr/bin/env ruby

require 'rubygems'
require 'zip'

file = $ARGV[0]

file_count = 0

File.open(file) do |f|
  Zip::InputStream.open(f) do |zio|
    while zio.get_next_entry
      file_count += 1
    end
  end
end

pp file_count

This prints 1.

When inspecting which entry it picked up:

#<Zip::Entry:0x0000557b571f2b88
 @comment="",
 @compressed_size=0,
 @compression_method=8,
 @crc=0,
 @dirty=false,
 @external_file_attributes=0,
 @extra=
  {"UniversalTime"=>
    #<Zip::ExtraField::UniversalTime:0x0000557b571f1260
     @atime=2021-11-17 11:56:46 +0100,
     @ctime=2021-11-17 11:56:45 +0100,
     @flag=7,
     @mtime=2021-11-17 11:56:44 +0100>,
   "Unknown"=>"ux\v\x00\x01\x04\xF5\x01\x00\x00\x04\x14\x00\x00\x00"},
 @extra_length=32,
 @filepath=nil,
 @follow_symlinks=false,
 @fstype=0,
 @ftype=:file,
 @gp_flags=8,
 @header_signature=67324752,
 @internal_file_attributes=1,
 @last_mod_date=21361,
 @last_mod_time=24342,
 @local_header_offset=0,
 @local_header_size=120,
 @name="screencapture-about-gitlab-company-2021-11-17-11_56_36.png",
 @name_length=58,
 @restore_ownership=false,
 @restore_permissions=false,
 @restore_times=false,
 @size=4496468,
 @time=2021-11-17 11:56:44 +0100,
 @unix_gid=nil,
 @unix_perms=nil,
 @unix_uid=nil,
 @version=20,
 @version_needed_to_extract=20,
 @zipfile=#<File:Archive.zip>>

Why is only a single entry being yielded and all other files are ignored? Does anything stand out about this archive? Here is what zipinfo -v has to say about it: https://gist.github.com/mttkay/9732ad4bb2d204716b15e04ecf880378

@mttkay
Copy link
Author

mttkay commented Nov 22, 2021

I noticed that zipinfo lists the single entry that is produced first. It looks as if InputStream isn't looking past the first entry and thinks the archive ends here?

@hainesr
Copy link
Member

hainesr commented Nov 22, 2021

Hi @mttkay, thanks for this.

I have no idea what is going on with the total_number_of_entries_in_cdir_on_this_disk field. I'll look into it and see if I can reproduce it. Were all other counts OK as far as you could see?

For the InputStream issue though:

TL;DR: I think you're seeing #493, which is fixed but not released yet. But even with that fix, you can't read files created by Archive with Zip::InputStream - you'll have to use Zip::File I'm afraid.

Archive doesn't construct Zip archives like most tools out there do. It streams files through its compressor and directly to disk in place, and then doesn't go back and fill in local header information for each file. This is fine in its own way, and is allowed for in the spec, but it's really for when you can't seek around in a file (i.e. when you really are streaming). The result is that there's no information in the local header about how big the compressed data is, so you can't decompress it using the local header alone - you need the Central Directory to fill in the gaps, because you do know how big the compressed data is for each entry by the time to get to writing out the CD at the end.

So you can't use Zip::InputStream to read these types of zip file because it (purposely) doesn't read the CD. Usually we complain loud and clear that this is the case, but Archive makes it more complex for us, by not quite following the spec properly again. When streaming you are supposed to leave the compressed size, uncompressed size and crc as zeros in the local header, and put them after the payload in the data descriptor, or "streaming header". Archive writes out the data descriptor, but doesn't set the uncompressed size to zero in the local headers, so it was confusing our original heuristic - hence the fix for #493.

Aside: Archive gives us headaches all the time, and we keep seeing new ones. It's amazing how much effort is wasted as a knock-on effect of one (unfortunately wildly popular) tool taking a lax attitude towards implementing standards, but we are where we are. Think Different I guess? Rant over 😆

Sorry that was all a bit rambling, but hopefully makes some sense.

@mttkay
Copy link
Author

mttkay commented Nov 23, 2021

Were all other counts OK as far as you could see?

I remember that size ended up being incorrect as well; I am not sure about any other fields specifically but nothing stood out at least.

Thanks for the detailed explanation -- that does sound difficult to deal with. 🤔 Happy to close this as a duplicate, though I'm still not sure whether the two issues merely coincided or are separate issues altogether.

If it helps at all, here is the MR on GitLab where the issue was first detected by the reviewer: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/74525

Here is a link to the archive they used for testing: https://gitlab.com/gitlab-org/gitlab/uploads/167723f76729d7a2e6342f6bbae2a786/Archive.zip

It could be interesting to run the (unreleased) entry_count implementation added in #506 against that file as well to see whether it yields the correct values. I had tested it back then with the refinement I added. Though I'm beginning to think there are two things at play here:

  • Problems with files created by the Archive Utility on macOS
  • Problems with reading files on macOS

I'm saying this because with that same file created by Archive, on my Linux laptop, it produced the correct results. It was just on both my co-worker's MacBooks that they were getting the wrong result, which is why I'm thinking there could be another issue with reading EOCD that only occurs when running on macOS. But it is speculation on my end at this point (I don't have a MacBook personally so I cannot verify myself.)

@hainesr
Copy link
Member

hainesr commented Jan 22, 2022

Apologies for not coming back to you on this for so long. Thank you again for such a detailed write-up.

I also can't reproduce this locally on Linux. I'll drop the zip file you link to above into a branch and push it through CI and see what happens.

In the meantime, the interesting bit of that zipinfo output you link to is this:

There are an extra 8 bytes preceding this file.

Which is printed for each entry, except the first one. I have no idea what is going on there yet...

@mttkay
Copy link
Author

mttkay commented Jan 23, 2022

No worries at all, it's not urgent. We have mitigated the problem by taking a product related trade-off and applying stricter limits. We are also looking to potentially move some of the heavier tasks of this kind into a Golang binary that we can shell out to, but that is just a proposal at this point.

hainesr added a commit to hainesr/rubyzip that referenced this issue Jan 23, 2022
I don't have a Mac so CI will have to do...
@hainesr
Copy link
Member

hainesr commented Jan 23, 2022

I can't seem to reproduce this in CI. The below are links into the MacOS runs where I've printed out the Zip::File objects. It shows both size and total_number_of_entries_in_cdir_on_this_disk as correct (6):

Unless I've mis-interpreted the issue you describe above - please let me know if so.

Thank you for your help in getting to the bottom of this!

@mttkay
Copy link
Author

mttkay commented Jan 24, 2022

Yes that looks correct, very odd 🤔 It was tough for me to investigate because I do not have access to a mac myself. Thank you for adding this test; this is great.

I just realized I did not specify in the issue description that GitLab is still on RubyZip 2.0.0, so not the latest version. When first investigating the issue where this popped up I remember running against the latest version on my machine to rule out any differences due to version drift, but the person who stumbled on this on their Mac during a code review did not IIRC.

I'm happy to close this out if this isn't reproducible. We can always investigate on our end again should it occur again, and re-file with a reproducible test case if necessary. It's always tough to tackle issues that have no reproducible test case attached.

hainesr added a commit to hainesr/rubyzip that referenced this issue Jan 30, 2022
I don't have a Mac so CI will have to do...
hainesr added a commit to hainesr/rubyzip that referenced this issue Jan 30, 2022
I don't have a Mac so CI will have to do...
@hainesr
Copy link
Member

hainesr commented Jan 30, 2022

Ah, OK, I've now moved that commit onto the 2.3.2 branch. There are a few failing tests due to Coveralls not being set up properly back that far, but the tests we're interested in do pass:

Those lines show that total_number_of_entries_in_cdir_on_this_disk is correct (6), and size is a lot higher up in the output and is also correct (6).

So it looks like I can't reproduce this at the moment, sorry.

@mttkay
Copy link
Author

mttkay commented Jan 31, 2022

If we have passing tests and no test case to reproduce this issue, I am fine with just closing this out. We can always re-open or re-file should we find out how to actually reproduce this reliably.

@hainesr
Copy link
Member

hainesr commented Jan 31, 2022

Thanks @mttkay, I agree. I'll close this for now and we can come back to it if we get a reproducible example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants