Fix updating zips whose entries have descriptors. #475

Numpsy · 2020-06-22T22:04:21Z

refs #147, and my comments in there - I'm not sure what the -4's in the descriptor size calculation are for, and removing them does appear to fix some issues that occur when updating archives whose entries contain descriptors.

I added one test for adding an entry, and one test for deleting an entry, which could perhaps be done in one go.

The addition test goes through the code path containing the TODO at

SharpZipLib/src/ICSharpCode.SharpZipLib/Zip/ZipFile.cs

Line 2753 in 32920f9

    
           // TODO: Find out why this calculation comes up 4 bytes short on some entries in ODT (Office Document Text) archives.

which sounds like it fits the change, but I haven't yet tested it with an actual ODT file.

I certify that I own, and have sufficient rights to contribute, all source code and related material intended to be compiled or integrated with the source code for the SharpZipLib open source product (the "Contribution"). My Contribution is licensed under the MIT License.

…tors

piksel

This not entirely correct as the "real" DataDescriptorSize SHOULD be 12 bytes, and not 16.

The reason for setting it to 16 is due to the unofficial signature 0x08074b50 that MAY be prepended to the descriptors.

I think the real solution for this is to change the constant to 12 bytes and update all references to it and making sure they handle both descriptors with and without the signature.

I think that it currently supports reading both, but the updating code just presumes that the signature is there.

Numpsy · 2020-08-07T22:51:00Z

I'm not sure if some cases (the ones where entries are deleted and it moves data about during the update) need/want to be copying the descriptor - given that ZipFile knows what the size etc values are at that point, it could possibly drop the descriptors instead? (section 4.3.9.1 of the appnote suggests not including them when the output is seekable, though maybe thats about creation rather than updating?)
(I think there might be some cases now where it fills in the fields in the header, and then keeps the descriptor as well, which might not be right/desirable? the screen shot in #147 suggests that).

Otherwise - check if there is a descriptor signature directly after the file data, and then decide how much to copy based on it being present or not? (or maybe add the signature if it isn't there already - section 4.3.9.4 of the appnote says it should be included)

I think that it currently supports reading both

think ZipInputStream might require the signature?

but the updating code just presumes that the signature is there.

presumes it's there or not there? (if the subtract 4 calculation is so that it copies the amount of data from a descriptor with no signature i mean)

piksel · 2020-08-10T18:06:23Z

I created a report from your test:
https://archivediagstore.z16.web.core.windows.net/reports/4aczwu4oakmYzl7IA1Zibz.html

Notice that there is only one local header, this is because the signature for the second file gets written to the second part of the DataDescriptor Uncompressed Size for the previous entry.
This seems to happen because the CopyEntryDirect method calculates the destinationPosition using GetDescriptorSize() which removes the 4 signature bytes from the size. This means that the calculation will be correct if no signature was present in the archive entry before updating.
So, instead of just removing the 4 byte offset, we should add a second argument to GetDescriptorSize that is either a bool withSignature or something that can be used to deduce if the signature is present or not.

Regarding #147, if you look at the descriptor uncompressed size in hex it's 0x504B0304 which is...

"PK\03\04" = LocalHeaderSignature

One odd thing I noticed was that the zip64 extra data for the local header is 0xffffffffffffffff and not 0x0000000000000000 which doesn't seem right. I guess the extra data shouldn't be written at all when using descriptors? I couldn't find any information about this in the APPNOTE though...

piksel · 2020-08-10T21:20:50Z

Some more discoveries regarding descriptors:

ZipOutputStream always writes the signature (if the stream is non-seekable that is)
ZipFile ignores the descriptor and always uses the sizes from the central directory, but compares them with the local header sizes if Descriptor bit is not set (comments claiming that there is no way to compare the sizes if set in descriptor (?!)).
ZipInputStream requires the signature to be present when reading the sizes from the descriptor (after reading the data or closing the entry). It will throw on attempting to get the length of the entry and will instead rely on the deflate-stream to indicate the end of the entry data.

Without reverting unrelated code this time hopefully

piksel · 2020-08-11T00:28:12Z

I added my own solution to your branch, it is a bit heavier, but it should also fix those random zip files that skips the descriptor signature. Feel free to revert the changes if you don't think it is necessary, the rest of the library seems to fully disregard that possibility:

// In theory this may not be a descriptor according to PKZIP appnote.
// In practise its always there.
if (intValue != ZipConstants.DataDescriptorSignature)
{
	throw new ZipException("Data descriptor signature not found");
}

Numpsy · 2020-08-11T22:48:42Z

One odd thing I noticed was that the zip64 extra data for the local header is 0xffffffffffffffff and not 0x0000000000000000 which doesn't seem right. I guess the extra data shouldn't be written at all when using descriptors? I couldn't find any information about this in the APPNOTE though...

I'm not sure.
Section 4.4.9 of the appnote says

If bit 3 of the general purpose bit flag is set, 
       these fields are set to zero in the local header and the 
       correct values are put in the data descriptor and
       in the central directory.  If an archive is in ZIP64 format
       and the value in this field is 0xFFFFFFFF, the size will be
       in the corresponding 8 byte ZIP64 extended information 
       extra field.

which you could take to mean that in descriptor mode, the regular values get set to 0, and then because they aren't 0xFFFFFFFF the real values won't be in the Zip64 extra data (which might not serve a purpose in this situation)?

My comment about 147 was referring to the local header size fields being non-zero:

So having the descriptor flag set, and having the local header values being non-zero, and then putting the sizes in the descriptor as well doesn't seem right from that appnote bit?

piksel · 2020-08-12T09:24:54Z

Oh, right, yeah it should either clear the local header sizes (0 or -1) or clear the descriptor bit.

My interpretation is that when Zip64 is used, the field should still be set to -1 to indicate that the sizes are in 64-bit mode, but that the zero values should be put in the Zip64 Extra Data fields. If you would set the local header sizes to 0 there would be no way to know the lengths of the descriptor fields, right?

In the report generated above we can see that it does work for Zip64:

Compressed Size: 4294967295 (Zip64 Indicator)
Uncompressed Size: 4294967295 (Zip64 Indicator)

But that is expected since the values are the same for descriptor-aided and regular entries.

Numpsy · 2020-08-16T11:08:17Z

ah, yes, it needs the extra data field to infer zip64 format, which is then used to determine what size the descriptor field would be.

if it's possible to clear the descriptor bit and not have to copy the descriptor then that removes the need for working out how big it is etc, which should simplify things? I'll have a look at what that change might require.

piksel · 2020-08-16T11:41:55Z

It does simplify it, but it also makes it a bit more complicated.

The code that actually caused the corruption was calculating how many bytes to skip to seek to the next entry, so that calculation would still need to be performed, unless we rewrite all entries with non-descriptor versions when doing an update, but then it's less of an update to the archive and more of a total recreation of the archive.

Consider the I/O and CPU required to add a 3 byte text file to a 4 GB archive. Just seeking to the end of the last entry, appending it and then writing the new central directory is pretty fast, but moving all the data could be really expensive.

Numpsy · 2020-08-16T14:13:22Z

My thought was that it could perhaps be changed in the !skipOver case in CopyEntryDirect at

SharpZipLib/src/ICSharpCode.SharpZipLib/Zip/ZipFile.cs

Line 2728 in 61d3a21

if (!skipOver)

, which is the case that already rewrites the local entry header and copies all the file content around (in the non-skip path of the code at

SharpZipLib/src/ICSharpCode.SharpZipLib/Zip/ZipFile.cs

Line 2765 in 61d3a21

if (update.Entry.CompressedSize > 0)

)

The other arm @

SharpZipLib/src/ICSharpCode.SharpZipLib/Zip/ZipFile.cs

Line 2759 in 61d3a21

destinationPosition +=

which just leaves the data as it and moves over it would still need to work out how many bytes the descriptor is.

I think that would cover removing the descriptors in cases where the data has to be moved about due to entries being deleted, but in your example of just adding an extra file to the end of an archive which is otherwise unchanged it would leave everything as it does now?

There's also the separate case of CopyEntry that is used when doing updates in safe mode, but as that is already building a new temporary file, just dropping the descriptors altogether in that case might be ok?

piksel · 2020-08-16T16:55:45Z

Yeah, that should be fine. It seems like a larger fix, but if you're up to it... 😄

Numpsy · 2020-08-17T20:49:27Z

I'll have a look.
What's the timescale for the 1.3 version, if this would be useful to go into that?

piksel · 2020-08-18T08:18:34Z

Well, the timescale is roughly "when this is done", I think. I figured this would be a good candidate, but we can also skip it and release it with 1.3.1 later

Numpsy · 2020-08-18T22:05:43Z

Ok, i'll try to have another look at it tommorow.

Numpsy · 2020-10-08T09:54:01Z

perhaps best to stick with this one to start with, and think about the descriptor-removal change (#512) seperately?

(512 does have a couple of extra unit tests that might be useful though, will have to remind myself which changes were where)

piksel · 2020-10-08T14:36:05Z

My initial instinct was that it was safer to do this first. It seems more like fixing a bug than altering the behaviour. But then again, the changes #512 makes more sense than sticking to the descriptors as they are inferior (and we obviously can seek if we are updating a ZipFile (or creating an altered copy)). That's why I figured it was better to just release 1.3 and pick this up in the next version, which hopefully won't take too long to get to.

- Unit tests for using ZipFile to update file entries that have descriptors - Fix size calculation in GetDescriptorSize - Handle optional descriptor signature when updating

Numpsy added 2 commits June 22, 2020 22:47

Unit tests for using ZipFile to update file entries that have descrip…

68c0a58

…tors

fix size calculation in GetDescriptorSize

6411e6d

piksel self-assigned this Jun 23, 2020

piksel self-requested a review June 23, 2020 18:35

piksel requested changes Aug 7, 2020

View reviewed changes

piksel added 2 commits August 11, 2020 02:09

Handle optional descriptor signature when updating

01e8135

Handle optional descriptor signature when updating

0fd6396

Without reverting unrelated code this time hopefully

piksel added this to the v1.3 milestone Aug 16, 2020

Numpsy mentioned this pull request Aug 19, 2020

experiment with removing descriptors from zip file entries during updates #512

Draft

HowToDoThis added a commit to HowToDoThis/SharpZipLib that referenced this pull request Nov 21, 2020

Cleanup and Implement icsharpcode#475

5a15b93

piksel approved these changes Nov 21, 2020

View reviewed changes

Merge branch 'master' into rw/zipfile/test_descriptor_update

2eedb2f

piksel merged commit c5c187d into icsharpcode:master Nov 21, 2020

Numpsy deleted the rw/zipfile/test_descriptor_update branch December 6, 2020 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix updating zips whose entries have descriptors. #475

Fix updating zips whose entries have descriptors. #475

Numpsy commented Jun 22, 2020

piksel left a comment

Numpsy commented Aug 7, 2020

piksel commented Aug 10, 2020 •

edited

piksel commented Aug 10, 2020

piksel commented Aug 11, 2020

Numpsy commented Aug 11, 2020

piksel commented Aug 12, 2020

Numpsy commented Aug 16, 2020

piksel commented Aug 16, 2020

Numpsy commented Aug 16, 2020

piksel commented Aug 16, 2020

Numpsy commented Aug 17, 2020

piksel commented Aug 18, 2020

Numpsy commented Aug 18, 2020

Numpsy commented Oct 8, 2020

piksel commented Oct 8, 2020

Fix updating zips whose entries have descriptors. #475

Fix updating zips whose entries have descriptors. #475

Conversation

Numpsy commented Jun 22, 2020

piksel left a comment

Choose a reason for hiding this comment

Numpsy commented Aug 7, 2020

piksel commented Aug 10, 2020 • edited

piksel commented Aug 10, 2020

piksel commented Aug 11, 2020

Numpsy commented Aug 11, 2020

piksel commented Aug 12, 2020

Numpsy commented Aug 16, 2020

piksel commented Aug 16, 2020

Numpsy commented Aug 16, 2020

piksel commented Aug 16, 2020

Numpsy commented Aug 17, 2020

piksel commented Aug 18, 2020

Numpsy commented Aug 18, 2020

Numpsy commented Oct 8, 2020

piksel commented Oct 8, 2020

piksel commented Aug 10, 2020 •

edited