You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[I searched but did not find an existing discussion of parallel inflate(). If I missed it, then please point to it. Thanks.]
Hardware-specific and general software improvements to single-threaded code can reduce time for inflate() by percentages, but I'm interested in using multi-threaded parallelism to speed inflate() by a factor of 3 to 8 on common consumer-grade multiprocessor CPU chips. Here's one idea that ought to work today with backward compatibility across all existing inflate(), plus another idea that uses a natural protocol extension to enable general protocol expansion and efficient use of one more processor during inflate().
The backward-compatible idea uses zero-length Stored blocks in the output from compression in order to mark the parallel workloads for inflate(). During deflate(): if any Stored output block would otherwise contain the four-byte sequence 00 00 FF FF, then split that block into two pieces which separate the sequence. In general, deflate() much like pigz. Choose a block size for input. After processing that many input bytes, then flush the pipelines and byte-align the output . Now insert a zero-length Stored block, which contains the byte-aligned four-byte sequence 00 00 FF FF, before beginning to process the next input byte. Also, discard any look-back history information. The resulting complete output is a sequence of independent (but ordered) workloads for inflate(), separated by the 00 00 FF FF, and which obeys the existing protocol and is accepted by all existing inflate(). The byte-aligned marking sequence 00 00 FF FF never is produced by usual deflate(). So inflate() can dedicate one processor to finding the marks and queuing the sequence of workloads, then joining the pool of inflate() processors when done with the finding. Finding the marks is much like strlen(), for which there are well-known fast algorithms (in particular, faster than inflating the encoded sequence) in both software and hardware.
The protocol extension, based on inflate.c:
switch (BITS(2)) {
case 0: /* stored block */
{...} break;
case 1:/* fixed block */
{...} break;
case 2: /* dynamic block */
{...} break;
case 3: /* INVALID */
{...}
state->mode = BAD;
}
Change case 3 from INVALID to COMMENT, where the byte length of the Comment is given by variable-width bitwise encoding using the ss11 method. Do not encode the MSB, which is known to be 1. The remaining lower-order bits are encoded MSB-first, with one data bit and one stop bit for each. Thus the values 4,5,6, and 7 each can be encoded in 4 bits because there are only 2 bits which are not the MSB. (Of course they can be encoded in only 2 bits, but that would preclude other Comment lengths and future protocol extensions.)
Any inflate() can ignore the Comment by decoding the Comment length, then skipping ahead that many bytes. Parallel inflate() takes the Comment lengths 4,5,6, and 7 to designate workload byte lengths of {input, output} pairs in widths with 4-bit increments: 16, 20, 24, or 28 bits for each. So a Comment length of 5 bytes (total 40 bits) can designate upto 1MiB of input and output. (Use length -1 so that the range is 1..1Mi instead of 0..(1Mi -1).) The lengths allow flexibility (including immediate seek()-able output and nested hierarchy) and error checking, and obviate the dedicated processing for finding the marks..
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
[I searched but did not find an existing discussion of parallel inflate(). If I missed it, then please point to it. Thanks.]
Hardware-specific and general software improvements to single-threaded code can reduce time for inflate() by percentages, but I'm interested in using multi-threaded parallelism to speed inflate() by a factor of 3 to 8 on common consumer-grade multiprocessor CPU chips. Here's one idea that ought to work today with backward compatibility across all existing inflate(), plus another idea that uses a natural protocol extension to enable general protocol expansion and efficient use of one more processor during inflate().
The backward-compatible idea uses zero-length Stored blocks in the output from compression in order to mark the parallel workloads for inflate(). During deflate(): if any Stored output block would otherwise contain the four-byte sequence 00 00 FF FF, then split that block into two pieces which separate the sequence. In general, deflate() much like
pigz
. Choose a block size for input. After processing that many input bytes, then flush the pipelines and byte-align the output . Now insert a zero-length Stored block, which contains the byte-aligned four-byte sequence 00 00 FF FF, before beginning to process the next input byte. Also, discard any look-back history information. The resulting complete output is a sequence of independent (but ordered) workloads for inflate(), separated by the 00 00 FF FF, and which obeys the existing protocol and is accepted by all existing inflate(). The byte-aligned marking sequence 00 00 FF FF never is produced by usual deflate(). So inflate() can dedicate one processor to finding the marks and queuing the sequence of workloads, then joining the pool of inflate() processors when done with the finding. Finding the marks is much likestrlen()
, for which there are well-known fast algorithms (in particular, faster than inflating the encoded sequence) in both software and hardware.The protocol extension, based on inflate.c:
Change
case 3
from INVALID to COMMENT, where the byte length of the Comment is given by variable-width bitwise encoding using thess11
method. Do not encode the MSB, which is known to be 1. The remaining lower-order bits are encoded MSB-first, with one data bit and one stop bit for each. Thus the values 4,5,6, and 7 each can be encoded in 4 bits because there are only 2 bits which are not the MSB. (Of course they can be encoded in only 2 bits, but that would preclude other Comment lengths and future protocol extensions.)Any inflate() can ignore the Comment by decoding the Comment length, then skipping ahead that many bytes. Parallel inflate() takes the Comment lengths 4,5,6, and 7 to designate workload byte lengths of {input, output} pairs in widths with 4-bit increments: 16, 20, 24, or 28 bits for each. So a Comment length of 5 bytes (total 40 bits) can designate upto 1MiB of input and output. (Use length -1 so that the range is 1..1Mi instead of 0..(1Mi -1).) The lengths allow flexibility (including immediate seek()-able output and nested hierarchy) and error checking, and obviate the dedicated processing for finding the marks..
Beta Was this translation helpful? Give feedback.
All reactions