Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading entries with data descriptors broken on many levels #295

Closed
julik opened this issue Jul 27, 2016 · 5 comments
Closed

Reading entries with data descriptors broken on many levels #295

julik opened this issue Jul 27, 2016 · 5 comments

Comments

@julik
Copy link

julik commented Jul 27, 2016

I am currently working out the last details on our own ZIP writer implementation. When doing so, I wrote an implementation of data descriptors + bit3 of the GP flags. Unfortunately, I simply cannot manage to make Rubyzip to read the entries that have data descriptors.

So far I saw the following spots where things go really wrong:

data_descriptor_size is inaccurate.

The data descriptor size is assumed to be 16 bytes (the data descriptor signature, the CRC and two sizes). However, for Zip64 entries the sizes will be 8 bytes long. Additionally, the signature is optional according to the APPNOTE. It is safe to assume it will be present, but computing offsets based off of it blindly is an invitation to a party.

##get_input_stream is supposed to raise an error, but doesnt

When reading a file using Zip::File.open() and obtaining an input stream for the entry body, the GP flag apparently does not get interpreted correctly. In the code I see that if bit3 is set, it is supposed to raise an error and tell me to use Zip::File. However, when I try to use Zip::File I get the same Entry returned that cannot be read

get_input_stream relies on reading local headers (which is THE issue here)

In general, relying on local headers for reading the entry body when you already have the central directory is madness. The local header for an entry with a data descriptor is forward-declared, and is very likely not to contain Zip64 extras - because when the entry was being written the writing code probably did not know the sizes of the file yet, and did not know the sizes of the compressed file data either. The fact that get_input_stream wants to read the local entry header, and potentially even clobbers the entry read from the central directory, makes it pretty impossible to do anything with data descriptors reliably - since if you use the forward-declared "dummy" local header, the only way to figure out where the file ends is to scan all the way ahead until you stumble upon the signature for the data descriptor. Which will never happen to be within a stored entry that you are reading, because people never put ZIPs within other ZIPs, right?

@local_header_size computed incorrectly (well, not computed rather)

When I try to open a file with data descriptors and raise-inspect the entries that Zip::File.foreach() gives me, the entries have their local header size set to zero. No wonder that trying to read them yields odd results, because probably the code thinks that the file data starts right at the offset where the local header is.

How to solve it

The only practical way to solve this, IMO, is to get rid of reading local headers altogether, because this way lies madness. Additionally, we have many situations where the local header can be written without the Zip64 extra fields, but since the file is located beyound the Zip64 span of the archive body you will have to write out the Zip64 central directory and the extra field anyway - but it will only be present in the central directory. So the only way I can see this being solvable is reading the entry metadata once, from the central directory, and using that as authoritative data. It also provides the exact offsets where the actual entry can be found.

Consequently, this is the approach taken by https://github.com/thejoshwolfe/yauzl - the author himself says that combining the local headers and the central directory information is insane and is basically unsolvable.

However, obviously, if you fix it that way it will not be backwards-compatible 100%, because some APIs will have to be removed. So I can tackle this but we'll have to break a few eggs in the process.

Curious about your opinions.

An example file that we write with our library can be found here: https://we.tl/IChJczPVYr
Opens perfectly fine with all the tools I tested it with, except... Rubyzip.

@idoru
Copy link

idoru commented Apr 6, 2018

I've recently opened PR #358 which falls back to central directory size if present when the local header general purpose bit 3 is set; not the cleanest fix but the simplest without re-architecting rubyzip. It worked for what I needed it for.

@hainesr
Copy link
Member

hainesr commented Jun 25, 2021

Thanks for this @julik and sorry it's taken so long to get round to looking at things around data descriptors.

I think you are absolutely right about not trying to merge the central directory and local header data. I tried it once and it did not go well. Unfortunately, it's not quite that simple thanks to extra fields 😢 In a few cases the extra field stored in the central directory is a marker, or cut-down version of that in the local header, so we do have to read the local header version. I fixed this recently (#459) by ignoring everything in the local header, except the extra fields.

But the rule of thumb is: use Zip::File if at all possible, and certainly if you want to treat the archive like a folder; use Zip::InputStream if you really must treat the archive like a stream.

I have tried your data-descriptors.zip file linked above - thanks! - and the current HEAD of rubyzip can read it with no issues using Zip::File.get_input_stream, so hopefully that means some things have been fixed over the years. And I've recently tightened up some checking to ensure that Zip::InputStream complains when it is used as the entry point with this type of file.

I will look at making sure we do something more sensible with data_descriptor_size, which I agree is completely wrong.

May I use data-descriptors.zip as a fixture to test against in future versions of rubyzip?

@hainesr
Copy link
Member

hainesr commented Jun 28, 2021

Update: I've removed Entry.data_descriptor_size and we now raise an error if something tries to use InputStream when a data descriptor is present (commit).

@julik
Copy link
Author

julik commented Oct 31, 2021

Lovely! Feel free to close the issue if appropriate, and thanks for getting back on this.

@hainesr
Copy link
Member

hainesr commented Oct 31, 2021

Thanks! I will close this one, since it is referenced in #460.

@hainesr hainesr closed this as completed Oct 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants