Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the PackBits filter for multidimensional arrays #517

Open
shz9 opened this issue Mar 22, 2024 · 0 comments
Open

Improving the PackBits filter for multidimensional arrays #517

shz9 opened this issue Mar 22, 2024 · 0 comments

Comments

@shz9
Copy link

shz9 commented Mar 22, 2024

Hi everyone,

This is not a necessarily a bug report about the PackBits codec, though there is some buggy/unexpected behavior that I'd like to discuss towards the end. I'm primarily interested in the way the filter can be used for multidimensional arrays and potentially enhanced for this setting. Currently, the way the filter operates is as follows:

  1. Normalize input to get bool data type:
arr = ensure_ndarray(buf).view(bool)
  1. Flatten the array:
arr = arr.reshape(-1, order='A')
  1. Determine if any padding needs to be done, and if so, store the number of bits that need to be padded. Then, we call numpy to pack the flattened array:
n = arr.size
n_bytes_packed = (n // 8)
n_bits_leftover = n % 8
if n_bits_leftover > 0:
    n_bytes_packed += 1

# setup output
enc = np.empty(n_bytes_packed + 1, dtype='u1')

# store how many bits were padded
if n_bits_leftover:
    n_bits_padded = 8 - n_bits_leftover
else:
    n_bits_padded = 0
enc[0] = n_bits_padded

# apply encoding
enc[1:] = np.packbits(arr)

However, when decoding, we no longer have information about the original shape, unless the user passes an out array that's shaped correctly as the original input. This is a bit limiting in a way and makes it hard to deploy the Codec more generically. So, as an alternative, I modified the implementation to not only store the bits to be padded, but also the number of dimensions in the original array as well as the dimension sizes. The way the new encoding works is as follows:

  1. Normalize input array
  2. Store information about its shape / dimensions.
    2.1) Store how many dimensions the array has.
    2.2) If the number of dimensions is greater than 1, then for each dimension:
    2.2.1) Assuming the dimension size is uint64 integer, pack this integer into 8 uint8 integers.
  3. This way, we're going to have to add up to 2 + 8*ndim bytes to the flattened array, which should be tiny for larger multi-dimensional arrays.

So, if the array is flat, we're only adding 1 byte. If it is multi-dimensional, the number of bytes we add is proportional to the number of dimensions. This is how the implementation looks like:

class PackBits(numcodecs.abc.Codec):
    """Codec to pack elements of a boolean array into bits in a uint8 array.

    Examples
    --------
    >>> import numcodecs
    >>> import numpy as np
    >>> codec = numcodecs.PackBits()
    >>> x = np.array([True, False, False, True], dtype=bool)
    >>> y = codec.encode(x)
    >>> y
    array([  4, 144], dtype=uint8)
    >>> z = codec.decode(y)
    >>> z
    array([ True, False, False,  True])

    Notes
    -----
    The first element of the encoded array stores the number of bits that
    were padded to complete the final byte.

    """

    codec_id = 'packbits'

    def __init__(self):    
        pass
        #self.dim_order = None


    def encode(self, buf):

        # normalise input
        arr = ensure_ndarray(buf).view(bool)
        # ------------------------------------
        # Store array shape:
        # Determine the dimension-related information to store:
        ndims = len(arr.shape) # The number of dimensions

        # The size of each dimension will be stored as uint64 number, 
        # corresponding to 8 int8 integers. If the array is flat already, 
        # dimension sizes will not be stored.
        if ndims > 1:
            dim_sizes = [int8code for d in arr.shape for int8code in struct.pack('Q', d)] 
        else:
            dim_sizes = []

        # ------------------------------------

        # Flatten to simplify implementation
        arr = arr.reshape(-1, order='A')

        # determine size of packed data
        n = arr.size
        n_bytes_packed = (n // 8)
        n_bits_leftover = n % 8
        if n_bits_leftover > 0:
            n_bytes_packed += 1

        # Setup output
        enc = np.empty(n_bytes_packed + 2 + len(dim_sizes), dtype='u1')

        # Determine how many bits were padded
        if n_bits_leftover:
            n_bits_padded = 8 - n_bits_leftover
        else:
            n_bits_padded = 0

        # Store how many bits were padded
        enc[0] = n_bits_padded

        # Store how many dimensions:
        enc[1] = ndims

        # Store dimension sizes:
        if len(dim_sizes) > 0:
            enc[2:len(dim_sizes) + 2] = dim_sizes

        # Apply encoding
        enc[2 + len(dim_sizes):] = np.packbits(arr)

        return enc


    def decode(self, buf, out=None):

        # normalise input
        enc = ensure_ndarray(buf).view('u1')

        # flatten to simplify implementation
        enc = enc.reshape(-1, order='A')

        # ----------------------------------
        # Figure out dimension / dimension sizes related information:

        ndims = enc[1]
        shapes = []
        # If more than 1D, extract dimension information:
        if ndims > 1:
            shape = []
            for i in range(ndims):
                shapes.append(struct.unpack('Q', bytes(enc[2 + i*8:2 + (i+1)*8]))[0])

        # ----------------------------------

        # Find out how many bits were padded
        n_bits_padded = int(enc[0])

        # Apply decoding
        dec = np.unpackbits(enc[2 + len(shapes)*8:])

        # Remove padded bits
        if n_bits_padded:
            dec = dec[:-n_bits_padded]

        # View as boolean array
        dec = dec.view(bool).reshape(shapes)

        # Handle destination
        return ndarray_copy(dec, out)

Possible improvements over this implementation:

  • Instead of storing dimension sizes as uint64, we can use uint32. With this, the number of bytes we'll need to add is 2 + 4*ndim. I went with uint64 to be on the safe side and accommodate huge arrays, though we're unlikely to have those in practice (since this filter will apply on chunks of larger arrays anyway).
  • One thing that would be nice to have is to allow the user to determine the order of the flattening, as this can have some consequences for the packing or compression quality. We could potentially take flatten_order as an input when initializing the Codec and use this in the encode method.
  • When experimenting with this implementation, I noticed some unexpected behavior due to the normalization step, in particular with the .view(bool) call. I understand that the Codec expects bool as inputs, but if I pass it anything else, for example, an array of 0/1 with dtype set to int16 or int32, then calling .view(bool) on it will distort its shape. This behavior is noted in the .view() documentation for numpy:

For a.view(some_dtype), if some_dtype has a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array), then the last axis of a must be contiguous.

So, what happened a few times is that I generated a random array with np.random.choice(2, size=(10, 20)) and when passing it to the codec without converting to bool first, I got unexpectedly shaped arrays. In this case, I'd say we should either raise an error when the input data type is not bool or occupies the same number of bytes as bool, or alternatively, convert to bool with a copy.

Version information

Python 3.11.5
numcodecs 0.12.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant