Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new APIs that allow copying zip file entries between zip files #98

Merged
merged 1 commit into from Nov 15, 2020
Merged

Add new APIs that allow copying zip file entries between zip files #98

merged 1 commit into from Nov 15, 2020

Conversation

robmv
Copy link
Contributor

@robmv robmv commented Feb 1, 2019

The copy is done directly using the raw compressed data, avoiding decompression and recompression.

In summary, the changes:

  • Add a private API for ZipFile to get the raw stream that doesn't do any CRC check or decompression.
  • Add the copy_file and copy_file_rename to ZipWriter that use that raw stream to copy the file.

I get nearly 95% faster Zip file modifications when I copy nearly all files, and just ignore the files I am removing from the Zip.

closes #95

@robmv
Copy link
Contributor Author

robmv commented Mar 14, 2019

Just pinging this pull request. Please tell me any changes required to accept this feature.

@J0sh0nat0r
Copy link

Any idea when this will be merged? Currently depending on this branch, no issues so far, great work! :)

@rylev
Copy link
Contributor

rylev commented Jun 15, 2020

@robmv unfortunately this branch needs a bunch of conflicts to be resolved. Would you be able to do that? I can then give this a review

@Plecra
Copy link
Member

Plecra commented Jun 17, 2020

Quickly skimming through, it doesn't look like this method validates the original zip file, which could result in a malformed archive. Ideally, the library shouldn't be able to do that by default.

I don't think using CRC should impose much extra overhead. If it causes a regression, it'd be worth making a new issue to track it

@robmv
Copy link
Contributor Author

robmv commented Jun 17, 2020

@robmv unfortunately this branch needs a bunch of conflicts to be resolved. Would you be able to do that? I can then give this a review

I will try to update it in the following days and see about the @Plecra suggestion about the CRC. The CRC will need the file to be decompressed If I am correct and that will add some overhead, It is probably negligible because the compression is the more CPU intensive operation and that will not be needed. But if this gives me a little overhead I will try to add an option to do the CRC or not.

In my use case I really don't need the CRC check because the files are digitally signed and the signature stored on an external file, and it is checked when then ZIP file is retrieved so the CRC check is redundant and I need speed on removing things from the file, but I understand some people will want the CRC to be validated.

@Plecra
Copy link
Member

Plecra commented Jun 23, 2020

Ah, I had been under the impression that the CRC was performed on the compressed data. I can completely understand why you wouldn't care for that, I'd just like to make the difference clear to users. Is there a convention like _unchecked for correctness? copy_unvalidated? copy_quick? copy_iffy?

@robmv
Copy link
Contributor Author

robmv commented Jul 23, 2020

Rebased. The only mayor difference with the previous patch is that the implementation of make_reader() was split in half, adding make_crypto_reader(). This was done in order to be able to return ZipResult for an unsupported compression method or bad password at the same time the current code does it. If I didn't split that, the Read implementation should have to hide these errors as io::Error, breaking the current behaviour.

I still doesn't do decompression and CRC checks when copying files, if anyone has suggestions for naming an API that does that and one that doesn't, I would gladly update it.

@robmv
Copy link
Contributor Author

robmv commented Aug 14, 2020

@rylev Still interested on doing the review?

@rylev
Copy link
Contributor

rylev commented Aug 19, 2020

@Plecra can you review this?

@rylev rylev requested a review from Plecra August 19, 2020 11:44
Copy link
Member

@Plecra Plecra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @robmv, just one little nitpick

src/write.rs Outdated Show resolved Hide resolved
@Plecra
Copy link
Member

Plecra commented Aug 19, 2020

In time, I'd also like to expand on the documentation of this feature. A doctest would be great.

I'd hope we could avoid using two methods here too, but I'm not sure there's a good solution with the current API

@robmv
Copy link
Contributor Author

robmv commented Aug 31, 2020

In time, I'd also like to expand on the documentation of this feature. A doctest would be great.

@Plecra: done!, doctest added to the raw_copy* methods.

@robmv robmv requested a review from Plecra August 31, 2020 21:03
@Plecra
Copy link
Member

Plecra commented Sep 3, 2020

@rylev I'm happy with this. Could you give the changes a once over?

@eulerdisk
Copy link

eulerdisk commented Sep 3, 2020

Does this new API work only for copying from an existing ZipFile? What if you have no original ZipFile to start from?
I'm doing an application where I need to spread the same data to different zip files, it would be great if I could compress it once before and then reuse the compressed data.

@robmv
Copy link
Contributor Author

robmv commented Sep 3, 2020

@eulerdisk true, only from an existing file. Your use case sounds interesting, but maybe an API for that will be more complex. what if the files are larger? leave the compressed version on RAM or dump to a temporary file?

With this proposed PR API you could create a temporary ZIP archive (RAM or file backed) and get the files to reuse from there.

@eulerdisk
Copy link

With this proposed PR API you could create a temporary ZIP archive (RAM or file backed) and get the files to reuse from there.

Yep, now I can do that thanks to your PR :-)

maybe an API for that will be more complex. what if the files are larger? leave the compressed version on RAM or dump to a temporary file?

My idea is that you should be able to create a ZipFile wrapping a Read object (which than can be in RAM or Disk), then add the ZipFile to the archive. Though I'm new to this library so I don't know if that make sense.

@Plecra
Copy link
Member

Plecra commented Sep 4, 2020

@eulerdisk Interesting idea! Compression is pretty fast, but there are bound to be use cases for caching its output.

We couldn't create a simple wrapper around Read, as it wouldn't provide somewhere to keep the compressed data. Instead, you could write to a ZipWriter<io::Cursor<Vec<u8>>> to cache the results, and copy it like @robmv explained. Maybe we could make this simpler in the future...

That said, I think it's beyond the scope of this PR and would prefer some design discussion before any code changes are proposed.

@Plecra
Copy link
Member

Plecra commented Sep 13, 2020

Sorry about this @robmv but we had to change the API of the crypto methods, and this PR now has merge conflicts. I'll probably figure out how to resolve them in time, so I'm just letting you know it could take a while

The copy is done directly using the raw compressed data, avoiding
decompression and recompression.
@robmv
Copy link
Contributor Author

robmv commented Nov 3, 2020

Updated, hopefully this one is the winner!

@Plecra
Copy link
Member

Plecra commented Nov 3, 2020

Lovely! I'm sorry to say I'm not sure when I'll next be able to complete the review, but this PR's a top priority 😉

@Plecra
Copy link
Member

Plecra commented Nov 15, 2020

Ace! Finally time to get this published. Thanks for your contribution @robmv

@Plecra Plecra merged commit 4d8a068 into zip-rs:master Nov 15, 2020
zacps added a commit to zacps/zip-rs that referenced this pull request Nov 16, 2020
@robmv robmv deleted the feature-copy branch November 16, 2020 12:18
kyrias pushed a commit to kyrias/zip that referenced this pull request May 6, 2024
refactor: Make `ZipWriter::finish()` consume the `ZipWriter`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add API to copy ZipFile instances directly to a new ZipWriter
5 participants