Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix updating zips whose entries have descriptors. #475

Merged
merged 5 commits into from Nov 21, 2020

Conversation

Numpsy
Copy link
Contributor

@Numpsy Numpsy commented Jun 22, 2020

refs #147, and my comments in there - I'm not sure what the -4's in the descriptor size calculation are for, and removing them does appear to fix some issues that occur when updating archives whose entries contain descriptors.

I added one test for adding an entry, and one test for deleting an entry, which could perhaps be done in one go.

The addition test goes through the code path containing the TODO at

// TODO: Find out why this calculation comes up 4 bytes short on some entries in ODT (Office Document Text) archives.
which sounds like it fits the change, but I haven't yet tested it with an actual ODT file.

I certify that I own, and have sufficient rights to contribute, all source code and related material intended to be compiled or integrated with the source code for the SharpZipLib open source product (the "Contribution"). My Contribution is licensed under the MIT License.

@piksel piksel self-assigned this Jun 23, 2020
@piksel piksel self-requested a review June 23, 2020 18:35
Copy link
Member

@piksel piksel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This not entirely correct as the "real" DataDescriptorSize SHOULD be 12 bytes, and not 16.

The reason for setting it to 16 is due to the unofficial signature 0x08074b50 that MAY be prepended to the descriptors.

I think the real solution for this is to change the constant to 12 bytes and update all references to it and making sure they handle both descriptors with and without the signature.

I think that it currently supports reading both, but the updating code just presumes that the signature is there.

@Numpsy
Copy link
Contributor Author

Numpsy commented Aug 7, 2020

I'm not sure if some cases (the ones where entries are deleted and it moves data about during the update) need/want to be copying the descriptor - given that ZipFile knows what the size etc values are at that point, it could possibly drop the descriptors instead? (section 4.3.9.1 of the appnote suggests not including them when the output is seekable, though maybe thats about creation rather than updating?)
(I think there might be some cases now where it fills in the fields in the header, and then keeps the descriptor as well, which might not be right/desirable? the screen shot in #147 suggests that).

Otherwise - check if there is a descriptor signature directly after the file data, and then decide how much to copy based on it being present or not? (or maybe add the signature if it isn't there already - section 4.3.9.4 of the appnote says it should be included)

I think that it currently supports reading both

think ZipInputStream might require the signature?

but the updating code just presumes that the signature is there.

presumes it's there or not there? (if the subtract 4 calculation is so that it copies the amount of data from a descriptor with no signature i mean)

@piksel
Copy link
Member

piksel commented Aug 10, 2020

I created a report from your test:
https://archivediagstore.z16.web.core.windows.net/reports/4aczwu4oakmYzl7IA1Zibz.html

Notice that there is only one local header, this is because the signature for the second file gets written to the second part of the DataDescriptor Uncompressed Size for the previous entry.
This seems to happen because the CopyEntryDirect method calculates the destinationPosition using GetDescriptorSize() which removes the 4 signature bytes from the size. This means that the calculation will be correct if no signature was present in the archive entry before updating.
So, instead of just removing the 4 byte offset, we should add a second argument to GetDescriptorSize that is either a bool withSignature or something that can be used to deduce if the signature is present or not.

Regarding #147, if you look at the descriptor uncompressed size in hex it's 0x504B0304 which is...

"PK\03\04" = LocalHeaderSignature

One odd thing I noticed was that the zip64 extra data for the local header is 0xffffffffffffffff and not 0x0000000000000000 which doesn't seem right. I guess the extra data shouldn't be written at all when using descriptors? I couldn't find any information about this in the APPNOTE though...

@piksel
Copy link
Member

piksel commented Aug 10, 2020

Some more discoveries regarding descriptors:

  • ZipOutputStream always writes the signature (if the stream is non-seekable that is)
  • ZipFile ignores the descriptor and always uses the sizes from the central directory, but compares them with the local header sizes if Descriptor bit is not set (comments claiming that there is no way to compare the sizes if set in descriptor (?!)).
  • ZipInputStream requires the signature to be present when reading the sizes from the descriptor (after reading the data or closing the entry). It will throw on attempting to get the length of the entry and will instead rely on the deflate-stream to indicate the end of the entry data.

@piksel
Copy link
Member

piksel commented Aug 11, 2020

I added my own solution to your branch, it is a bit heavier, but it should also fix those random zip files that skips the descriptor signature. Feel free to revert the changes if you don't think it is necessary, the rest of the library seems to fully disregard that possibility:

// In theory this may not be a descriptor according to PKZIP appnote.
// In practise its always there.
if (intValue != ZipConstants.DataDescriptorSignature)
{
	throw new ZipException("Data descriptor signature not found");
}

@Numpsy
Copy link
Contributor Author

Numpsy commented Aug 11, 2020

One odd thing I noticed was that the zip64 extra data for the local header is 0xffffffffffffffff and not 0x0000000000000000 which doesn't seem right. I guess the extra data shouldn't be written at all when using descriptors? I couldn't find any information about this in the APPNOTE though...

I'm not sure.
Section 4.4.9 of the appnote says

If bit 3 of the general purpose bit flag is set, 
       these fields are set to zero in the local header and the 
       correct values are put in the data descriptor and
       in the central directory.  If an archive is in ZIP64 format
       and the value in this field is 0xFFFFFFFF, the size will be
       in the corresponding 8 byte ZIP64 extended information 
       extra field.

which you could take to mean that in descriptor mode, the regular values get set to 0, and then because they aren't 0xFFFFFFFF the real values won't be in the Zip64 extra data (which might not serve a purpose in this situation)?

My comment about 147 was referring to the local header size fields being non-zero:

image

So having the descriptor flag set, and having the local header values being non-zero, and then putting the sizes in the descriptor as well doesn't seem right from that appnote bit?

@piksel
Copy link
Member

piksel commented Aug 12, 2020

Oh, right, yeah it should either clear the local header sizes (0 or -1) or clear the descriptor bit.

My interpretation is that when Zip64 is used, the field should still be set to -1 to indicate that the sizes are in 64-bit mode, but that the zero values should be put in the Zip64 Extra Data fields. If you would set the local header sizes to 0 there would be no way to know the lengths of the descriptor fields, right?

In the report generated above we can see that it does work for Zip64:

Compressed Size: 4294967295 (Zip64 Indicator)
Uncompressed Size: 4294967295 (Zip64 Indicator)

But that is expected since the values are the same for descriptor-aided and regular entries.

@Numpsy
Copy link
Contributor Author

Numpsy commented Aug 16, 2020

ah, yes, it needs the extra data field to infer zip64 format, which is then used to determine what size the descriptor field would be.

if it's possible to clear the descriptor bit and not have to copy the descriptor then that removes the need for working out how big it is etc, which should simplify things? I'll have a look at what that change might require.

@piksel
Copy link
Member

piksel commented Aug 16, 2020

It does simplify it, but it also makes it a bit more complicated.

The code that actually caused the corruption was calculating how many bytes to skip to seek to the next entry, so that calculation would still need to be performed, unless we rewrite all entries with non-descriptor versions when doing an update, but then it's less of an update to the archive and more of a total recreation of the archive.

Consider the I/O and CPU required to add a 3 byte text file to a 4 GB archive. Just seeking to the end of the last entry, appending it and then writing the new central directory is pretty fast, but moving all the data could be really expensive.

@Numpsy
Copy link
Contributor Author

Numpsy commented Aug 16, 2020

My thought was that it could perhaps be changed in the !skipOver case in CopyEntryDirect at

, which is the case that already rewrites the local entry header and copies all the file content around (in the non-skip path of the code at
if (update.Entry.CompressedSize > 0)
)

The other arm @

which just leaves the data as it and moves over it would still need to work out how many bytes the descriptor is.

I think that would cover removing the descriptors in cases where the data has to be moved about due to entries being deleted, but in your example of just adding an extra file to the end of an archive which is otherwise unchanged it would leave everything as it does now?

There's also the separate case of CopyEntry that is used when doing updates in safe mode, but as that is already building a new temporary file, just dropping the descriptors altogether in that case might be ok?

@piksel
Copy link
Member

piksel commented Aug 16, 2020

Yeah, that should be fine. It seems like a larger fix, but if you're up to it... 😄

@piksel piksel added this to the v1.3 milestone Aug 16, 2020
@Numpsy
Copy link
Contributor Author

Numpsy commented Aug 17, 2020

I'll have a look.
What's the timescale for the 1.3 version, if this would be useful to go into that?

@piksel
Copy link
Member

piksel commented Aug 18, 2020

Well, the timescale is roughly "when this is done", I think. I figured this would be a good candidate, but we can also skip it and release it with 1.3.1 later

@Numpsy
Copy link
Contributor Author

Numpsy commented Aug 18, 2020

Ok, i'll try to have another look at it tommorow.

@Numpsy
Copy link
Contributor Author

Numpsy commented Oct 8, 2020

perhaps best to stick with this one to start with, and think about the descriptor-removal change (#512) seperately?

(512 does have a couple of extra unit tests that might be useful though, will have to remind myself which changes were where)

@piksel
Copy link
Member

piksel commented Oct 8, 2020

My initial instinct was that it was safer to do this first. It seems more like fixing a bug than altering the behaviour. But then again, the changes #512 makes more sense than sticking to the descriptors as they are inferior (and we obviously can seek if we are updating a ZipFile (or creating an altered copy)). That's why I figured it was better to just release 1.3 and pick this up in the next version, which hopefully won't take too long to get to.

HowToDoThis added a commit to HowToDoThis/SharpZipLib that referenced this pull request Nov 21, 2020
piksel added a commit that referenced this pull request Nov 21, 2020
- Unit tests for using ZipFile to update file entries that have descriptors
- Fix size calculation in GetDescriptorSize
- Handle optional descriptor signature when updating
@piksel piksel merged commit c5c187d into icsharpcode:master Nov 21, 2020
@Numpsy Numpsy deleted the rw/zipfile/test_descriptor_update branch December 6, 2020 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants