Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable endianness alteration on unserialization of numpy arrays in joblib.Parallel #1561

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

fcharras
Copy link
Contributor

@fcharras fcharras commented Apr 4, 2024

Fixes #1545

Still todo:

@fcharras
Copy link
Contributor Author

fcharras commented Apr 5, 2024

Regarding the second and third points, in fact it's not an issue because _ensure_native_byte_order is only used in joblib.Parallel calls when auto-memmapping is triggered. Byteorder for small arrays has always been preserved from main process to child process. I will add a unit test that show that in a separate PR.

@fcharras fcharras self-assigned this Apr 5, 2024
@fcharras
Copy link
Contributor Author

fcharras commented Apr 5, 2024

@alexisthual do you want to try this branch on your usecase and confirm that it fixes it ?

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Here is some feedback. The big picture is that we should never pretend in the code that is possible to swap endianess of memory mapped arrays: the result would be a newly allocated array in memory and no longer backed by a memory mapped buffer managed by the OS.

# TODO: should issue a warning here ?
# because the array gets copied in the process.
# which is precisely what memmap users doesn't want.
marray = _ensure_native_byte_order(marray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no point in changing read_mmap to ever call _ensure_native_byte_order: it's not possible to swap the byte ordering without making a memory copy and the point of loading pickled array with memory mapping is to not allocate new memory but to memory map the buffer.


def read(self, unpickler):
def read(self, unpickler, ensure_native_byte_order=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would set it to False by default in the private API to not imply that it's possible to always do that operation.

Suggested change
def read(self, unpickler, ensure_native_byte_order=True):
def read(self, unpickler, ensure_native_byte_order=False):

@@ -247,9 +256,9 @@ def read(self, unpickler):
"""
# When requested, only use memmap mode if allowed.
if unpickler.mmap_mode is not None and self.allow_mmap:
array = self.read_mmap(unpickler)
array = self.read_mmap(unpickler, ensure_native_byte_order)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never call read_mmap when ensure_native_byte_order is not False.

We could therefore change the code has follows:

Suggested change
array = self.read_mmap(unpickler, ensure_native_byte_order)
if ensure_native_byte_order:
raise ValueError(
"It is not possible to ensure native byte ordering with"
f"mmap_mode={unpickler.mmap_mode}."
)
array = self.read_mmap(unpickler)

however I have the feeling that this exception should never be reachable because the public API should never allow that combination of options.

assert byteorder_in_worker == x_returned.dtype.byteorder
np.testing.assert_array_equal(x, x_returned)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also add a test for the following case:

joblib.load(filename, mmap_mode="r") where filename points to a joblib pickle file of a big endian numpy array then check that the result is:

  • a real memory mapped array (an instance of numpy.memmap with a valid filename;
  • and that its byte order is preserved (still big endian).

@@ -563,15 +580,19 @@ def dump(value, filename, compress=0, protocol=None, cache_size=None):
return [filename]


def _unpickle(fobj, filename="", mmap_mode=None):
def _unpickle(fobj, filename="", mmap_mode=None,
ensure_native_byte_order=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ensure_native_byte_order=True):
ensure_native_byte_order="auto"):

@@ -563,15 +580,19 @@ def dump(value, filename, compress=0, protocol=None, cache_size=None):
return [filename]


def _unpickle(fobj, filename="", mmap_mode=None):
def _unpickle(fobj, filename="", mmap_mode=None,
ensure_native_byte_order=True):
"""Internal unpickling function."""
# We are careful to open the file handle early and keep it open to
# avoid race-conditions on renames.
# That said, if data is stored in companion files, which can be
# the case with the old persistence format, moving the directory
# will create a race when joblib tries to access the companion
# files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# files.
# files.
if ensure_native_byte_order == "auto":
ensure_native_byte_order = mmap_mode is not None
if ensure_native_byte_order and mmap_mode is not None:
raise ValueError(
"It is not possible to ensure native byte ordering with"
f"mmap_mode={unpickler.mmap_mode}."
)

(same remark about the reach-ability of this exception from the public API. But I have the feeling that it is saner to make it explicit in the code that the case ensure_native_byte_order and mmap_mode is not None is physically not achievable in general.

@ogrisel
Copy link
Contributor

ogrisel commented Apr 5, 2024

Also we should not forget to document the fix in the changelog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Large numpy arrays stored in big-endian format cannot be serialized, leading to errors with Parallel
2 participants