New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable endianness alteration on unserialization of numpy arrays in joblib.Parallel
#1561
base: main
Are you sure you want to change the base?
Conversation
8e1deb4
to
395e9de
Compare
Regarding the second and third points, in fact it's not an issue because |
395e9de
to
b01f2b7
Compare
@alexisthual do you want to try this branch on your usecase and confirm that it fixes it ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Here is some feedback. The big picture is that we should never pretend in the code that is possible to swap endianess of memory mapped arrays: the result would be a newly allocated array in memory and no longer backed by a memory mapped buffer managed by the OS.
# TODO: should issue a warning here ? | ||
# because the array gets copied in the process. | ||
# which is precisely what memmap users doesn't want. | ||
marray = _ensure_native_byte_order(marray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no point in changing read_mmap
to ever call _ensure_native_byte_order
: it's not possible to swap the byte ordering without making a memory copy and the point of loading pickled array with memory mapping is to not allocate new memory but to memory map the buffer.
|
||
def read(self, unpickler): | ||
def read(self, unpickler, ensure_native_byte_order=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would set it to False
by default in the private API to not imply that it's possible to always do that operation.
def read(self, unpickler, ensure_native_byte_order=True): | |
def read(self, unpickler, ensure_native_byte_order=False): |
@@ -247,9 +256,9 @@ def read(self, unpickler): | |||
""" | |||
# When requested, only use memmap mode if allowed. | |||
if unpickler.mmap_mode is not None and self.allow_mmap: | |||
array = self.read_mmap(unpickler) | |||
array = self.read_mmap(unpickler, ensure_native_byte_order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never call read_mmap
when ensure_native_byte_order
is not False
.
We could therefore change the code has follows:
array = self.read_mmap(unpickler, ensure_native_byte_order) | |
if ensure_native_byte_order: | |
raise ValueError( | |
"It is not possible to ensure native byte ordering with" | |
f"mmap_mode={unpickler.mmap_mode}." | |
) | |
array = self.read_mmap(unpickler) |
however I have the feeling that this exception should never be reachable because the public API should never allow that combination of options.
assert byteorder_in_worker == x_returned.dtype.byteorder | ||
np.testing.assert_array_equal(x, x_returned) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also add a test for the following case:
joblib.load(filename, mmap_mode="r")
where filename points to a joblib pickle file of a big endian numpy array then check that the result is:
- a real memory mapped array (an instance of
numpy.memmap
with a valid filename; - and that its byte order is preserved (still big endian).
@@ -563,15 +580,19 @@ def dump(value, filename, compress=0, protocol=None, cache_size=None): | |||
return [filename] | |||
|
|||
|
|||
def _unpickle(fobj, filename="", mmap_mode=None): | |||
def _unpickle(fobj, filename="", mmap_mode=None, | |||
ensure_native_byte_order=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure_native_byte_order=True): | |
ensure_native_byte_order="auto"): |
@@ -563,15 +580,19 @@ def dump(value, filename, compress=0, protocol=None, cache_size=None): | |||
return [filename] | |||
|
|||
|
|||
def _unpickle(fobj, filename="", mmap_mode=None): | |||
def _unpickle(fobj, filename="", mmap_mode=None, | |||
ensure_native_byte_order=True): | |||
"""Internal unpickling function.""" | |||
# We are careful to open the file handle early and keep it open to | |||
# avoid race-conditions on renames. | |||
# That said, if data is stored in companion files, which can be | |||
# the case with the old persistence format, moving the directory | |||
# will create a race when joblib tries to access the companion | |||
# files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# files. | |
# files. | |
if ensure_native_byte_order == "auto": | |
ensure_native_byte_order = mmap_mode is not None | |
if ensure_native_byte_order and mmap_mode is not None: | |
raise ValueError( | |
"It is not possible to ensure native byte ordering with" | |
f"mmap_mode={unpickler.mmap_mode}." | |
) |
(same remark about the reach-ability of this exception from the public API. But I have the feeling that it is saner to make it explicit in the code that the case ensure_native_byte_order and mmap_mode is not None
is physically not achievable in general.
Also we should not forget to document the fix in the changelog. |
Fixes #1545
Still todo:
discuss where exactly we should enable / disable this feature ? to what extent should the issues in dtype.byteorder is not consistent during cross platform joblib.load() #1123 also be considered for unserialization of auto memmaped array injoblib.Parallel
?discuss potential breaking change ?