Disable endianness alteration on unserialization of numpy arrays in `joblib.Parallel` #1561

fcharras · 2024-04-04T17:20:53Z

Still todo:

add non regression test from min reproducer in Large numpy arrays stored in big-endian format cannot be serialized, leading to errors with Parallel #1545
discuss where exactly we should enable / disable this feature ? to what extent should the issues in dtype.byteorder is not consistent during cross platform joblib.load() #1123 also be considered for unserialization of auto memmaped array in joblib.Parallel ?
~~discuss potential breaking change ?~~

fcharras · 2024-04-05T08:39:58Z

Regarding the second and third points, in fact it's not an issue because _ensure_native_byte_order is only used in joblib.Parallel calls when auto-memmapping is triggered. Byteorder for small arrays has always been preserved from main process to child process. I will add a unit test that show that in a separate PR.

…joblib.Parallel`

fcharras · 2024-04-05T09:47:20Z

@alexisthual do you want to try this branch on your usecase and confirm that it fixes it ?

ogrisel

Thanks for the PR. Here is some feedback. The big picture is that we should never pretend in the code that is possible to swap endianess of memory mapped arrays: the result would be a newly allocated array in memory and no longer backed by a memory mapped buffer managed by the OS.

ogrisel · 2024-04-05T12:03:53Z

joblib/numpy_pickle.py

+            # TODO: should issue a warning here ?
+            # because the array gets copied in the process.
+            # which is precisely what memmap users doesn't want.
+            marray = _ensure_native_byte_order(marray)


I think there is no point in changing read_mmap to ever call _ensure_native_byte_order: it's not possible to swap the byte ordering without making a memory copy and the point of loading pickled array with memory mapping is to not allocate new memory but to memory map the buffer.

ogrisel · 2024-04-05T12:07:08Z

joblib/numpy_pickle.py


-    def read(self, unpickler):
+    def read(self, unpickler, ensure_native_byte_order=True):


I would set it to False by default in the private API to not imply that it's possible to always do that operation.

Suggested change

def read(self, unpickler, ensure_native_byte_order=True):

def read(self, unpickler, ensure_native_byte_order=False):

ogrisel · 2024-04-05T12:08:22Z

joblib/numpy_pickle.py

@@ -247,9 +256,9 @@ def read(self, unpickler):
        """
        # When requested, only use memmap mode if allowed.
        if unpickler.mmap_mode is not None and self.allow_mmap:
-            array = self.read_mmap(unpickler)
+            array = self.read_mmap(unpickler, ensure_native_byte_order)


We should never call read_mmap when ensure_native_byte_order is not False.

We could therefore change the code has follows:

Suggested change

array = self.read_mmap(unpickler, ensure_native_byte_order)

if ensure_native_byte_order:

raise ValueError(

"It is not possible to ensure native byte ordering with"

f"mmap_mode={unpickler.mmap_mode}."

)

array = self.read_mmap(unpickler)

however I have the feeling that this exception should never be reachable because the public API should never allow that combination of options.

ogrisel · 2024-04-05T12:14:42Z

joblib/test/test_parallel.py

+        assert byteorder_in_worker == x_returned.dtype.byteorder
+        np.testing.assert_array_equal(x, x_returned)
+
+


You should also add a test for the following case:

joblib.load(filename, mmap_mode="r") where filename points to a joblib pickle file of a big endian numpy array then check that the result is:

a real memory mapped array (an instance of numpy.memmap with a valid filename;

and that its byte order is preserved (still big endian).

ogrisel · 2024-04-05T12:15:01Z

joblib/numpy_pickle.py

@@ -563,15 +580,19 @@ def dump(value, filename, compress=0, protocol=None, cache_size=None):
    return [filename]


-def _unpickle(fobj, filename="", mmap_mode=None):
+def _unpickle(fobj, filename="", mmap_mode=None,
+              ensure_native_byte_order=True):


Suggested change

ensure_native_byte_order=True):

ensure_native_byte_order="auto"):

ogrisel · 2024-04-05T12:17:22Z

joblib/numpy_pickle.py

@@ -563,15 +580,19 @@ def dump(value, filename, compress=0, protocol=None, cache_size=None):
    return [filename]


-def _unpickle(fobj, filename="", mmap_mode=None):
+def _unpickle(fobj, filename="", mmap_mode=None,
+              ensure_native_byte_order=True):
    """Internal unpickling function."""
    # We are careful to open the file handle early and keep it open to
    # avoid race-conditions on renames.
    # That said, if data is stored in companion files, which can be
    # the case with the old persistence format, moving the directory
    # will create a race when joblib tries to access the companion
    # files.


Suggested change

# files.

# files.

if ensure_native_byte_order == "auto":

ensure_native_byte_order = mmap_mode is not None

if ensure_native_byte_order and mmap_mode is not None:

raise ValueError(

"It is not possible to ensure native byte ordering with"

f"mmap_mode={unpickler.mmap_mode}."

)

(same remark about the reach-ability of this exception from the public API. But I have the feeling that it is saner to make it explicit in the code that the case ensure_native_byte_order and mmap_mode is not None is physically not achievable in general.

ogrisel · 2024-04-05T12:20:42Z

Also we should not forget to document the fix in the changelog.

fcharras force-pushed the FIX/memmapping_endianness branch from 8e1deb4 to 395e9de Compare April 4, 2024 17:31

Add a test that ensure conservation of byte order during IPC

a4fe651

fcharras mentioned this pull request Apr 5, 2024

Add a test that ensure conservation of byte order during IPC #1562

Merged

fcharras added 2 commits April 5, 2024 11:33

disable endianness alteration on unserialization of numpy arrays in `…

852ec68

…joblib.Parallel`

Add non regression test for arrays that trigger auto-memmapping in IPC

b01f2b7

fcharras force-pushed the FIX/memmapping_endianness branch from 395e9de to b01f2b7 Compare April 5, 2024 09:42

fcharras self-assigned this Apr 5, 2024

Fixup. add @with_numpy tag

f14b07a

ogrisel reviewed Apr 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable endianness alteration on unserialization of numpy arrays in `joblib.Parallel` #1561

Disable endianness alteration on unserialization of numpy arrays in `joblib.Parallel` #1561

fcharras commented Apr 4, 2024 •

edited

fcharras commented Apr 5, 2024 •

edited

fcharras commented Apr 5, 2024

ogrisel left a comment

ogrisel Apr 5, 2024

ogrisel Apr 5, 2024

ogrisel Apr 5, 2024

ogrisel Apr 5, 2024

ogrisel Apr 5, 2024

ogrisel Apr 5, 2024

ogrisel commented Apr 5, 2024


		def read(self, unpickler):
		def read(self, unpickler, ensure_native_byte_order=True):

-            array = self.read_mmap(unpickler, ensure_native_byte_order)
+            if ensure_native_byte_order:
+                raise ValueError(
+                    "It is not possible to ensure native byte ordering with"
+                    f"mmap_mode={unpickler.mmap_mode}."
+                )
+            array = self.read_mmap(unpickler)

		assert byteorder_in_worker == x_returned.dtype.byteorder
		np.testing.assert_array_equal(x, x_returned)

	ensure_native_byte_order=True):
	ensure_native_byte_order="auto"):

Disable endianness alteration on unserialization of numpy arrays in joblib.Parallel #1561

Are you sure you want to change the base?

Disable endianness alteration on unserialization of numpy arrays in joblib.Parallel #1561

Conversation

fcharras commented Apr 4, 2024 • edited

fcharras commented Apr 5, 2024 • edited

fcharras commented Apr 5, 2024

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Apr 5, 2024

Choose a reason for hiding this comment

ogrisel Apr 5, 2024

Choose a reason for hiding this comment

ogrisel Apr 5, 2024

Choose a reason for hiding this comment

ogrisel Apr 5, 2024

Choose a reason for hiding this comment

ogrisel Apr 5, 2024

Choose a reason for hiding this comment

ogrisel Apr 5, 2024

Choose a reason for hiding this comment

ogrisel commented Apr 5, 2024

Disable endianness alteration on unserialization of numpy arrays in `joblib.Parallel` #1561

Disable endianness alteration on unserialization of numpy arrays in `joblib.Parallel` #1561

fcharras commented Apr 4, 2024 •

edited

fcharras commented Apr 5, 2024 •

edited