Skip to content

Commit

Permalink
Fix high memory usage due to Deflater buffering
Browse files Browse the repository at this point in the history
When support for ZipCrypto was added, an internal StringIO buffer was
added to Deflater, in order to fix a decryption bug. While this worked,
it caused unlimited memory growth when compressing large files. The
proper fix is to writer the encryption header in init_next_entry instead
of finalize_current_entry, so the headers are written before any
encrypted data. Because of this fix we can remove the buffering in
Deflater, which keeps memory usage low and allows to stream compressed
data while it is written.

This should fix issue #233.
  • Loading branch information
felixbuenemann committed Oct 17, 2015
1 parent a3ca219 commit 7a4b8bb
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 4 deletions.
9 changes: 6 additions & 3 deletions lib/zip/deflater.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,21 @@ def initialize(output_stream, level = Zip.default_compression, encrypter = NullE
@size = 0
@crc = ::Zlib.crc32
@encrypter = encrypter
@buffer_stream = ::StringIO.new('')
end

def <<(data)
val = data.to_s
@crc = Zlib.crc32(val, @crc)
@size += val.bytesize
@buffer_stream << @zlib_deflater.deflate(data)
buffer = @zlib_deflater.deflate(data)
if buffer.empty?
@output_stream
else
@output_stream << @encrypter.encrypt(buffer)
end
end

def finish
@output_stream << @encrypter.encrypt(@buffer_stream.string)
@output_stream << @encrypter.encrypt(@zlib_deflater.finish) until @zlib_deflater.finished?
end

Expand Down
2 changes: 1 addition & 1 deletion lib/zip/output_stream.rb
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,6 @@ def copy_raw_entry(entry)

def finalize_current_entry
return unless @current_entry
@output_stream << @encrypter.header(@current_entry.mtime)
finish
@current_entry.compressed_size = @output_stream.tell - @current_entry.local_header_offset - @current_entry.calculate_local_header_size
@current_entry.size = @compressor.size
Expand All @@ -139,6 +138,7 @@ def init_next_entry(entry, level = Zip.default_compression)
@entry_set << entry
entry.write_local_entry(@output_stream)
@encrypter.reset!
@output_stream << @encrypter.header(entry.mtime)
@compressor = get_compressor(entry, level)
end

Expand Down

6 comments on commit 7a4b8bb

@jjb
Copy link

@jjb jjb commented on 7a4b8bb Oct 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix!

  1. what do you mean by "large files" -- would it also cause memory growth when compressing any files over time?
  2. seems like a change like this should come with a test? (i'm not a maintainer, just whining from the sidelines...)

@jjb
Copy link

@jjb jjb commented on 7a4b8bb Oct 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I just saw the linked ticket... ignore question 1.

@felixbuenemann
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The previous code caused the complete deflated bytes for a single archive entry to be buffered to an internal StringIO, which means it was never flushed to the zip output stream until the entry was finished.
  2. It didn't change the public api which is why I didn't touch any existing tests and adding a test to assert that zlib is flushing to the output stream is a bit tricky, because it's up to the deflate implementation in zlib to decide when it's a good idea to flush. There are some tunables that can be used to force zlib to flush more predictably, but they are not exposed by the current rubyzip api.

@jjb
Copy link

@jjb jjb commented on 7a4b8bb Oct 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true!

Thanks again for the fix! I was just being vaguely paranoid about untested code going in, since this was already something that has potentially caused me a problem for months/years.

@felixbuenemann
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did test it manually both with and without encryption with about a gigabyte of generated data and with a real use case while working on a streaming xlsx writer. Still an automated test would be preferable.

@jjb
Copy link

@jjb jjb commented on 7a4b8bb Oct 29, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆒

Please sign in to comment.