Make sure arrays are bytes aligned in joblib pickles #1254

lesteve · 2022-02-03T08:26:49Z

alternative to #570

codecov · 2022-02-03T08:30:13Z

Codecov Report

Merging #1254 (188c089) into master (3d80506) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1254      +/-   ##
==========================================
+ Coverage   93.81%   93.83%   +0.01%     
==========================================
  Files          50       50              
  Lines        7181     7267      +86     
==========================================
+ Hits         6737     6819      +82     
- Misses        444      448       +4

Impacted Files	Coverage Δ
joblib/numpy_pickle.py	`99.14% <100.00%> (+0.15%)`	⬆️
joblib/test/test_numpy_pickle.py	`94.10% <100.00%> (+0.47%)`	⬆️
joblib/_parallel_backends.py	`92.27% <0.00%> (-0.74%)`	⬇️
joblib/parallel.py	`96.02% <0.00%> (-0.54%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d80506...188c089. Read the comment docs.

ogrisel

First comment, I haven't actually reviewed the code change in details.

ogrisel · 2022-02-03T13:27:28Z

joblib/numpy_pickle.py

@@ -95,6 +97,17 @@ def write_array(self, array, pickler):
            # pickle protocol.
            pickle.dump(array, pickler.file_handle, protocol=2)
        else:
+            try:
+                current_pos = pickler.file_handle.tell()
+                alignment = current_pos % 8


numpy documentation mentions that some dtypes would rather expect 16 bytes alignment (e.g. float128).

Also, since SIMD-optimized compute kernels would run more efficient (fully vectorized) if the buffers are directly aligned to their vector instructions sizes, maybe we should directly go for 64 bytes alignment (e.g. for AVX 512 which are currently the widest vector instructions).

In the ARM ecosystem there are also 512 bit wide vector instructions, e.g.:

https://www.fujitsu.com/global/products/computing/servers/supercomputer/a64fx/

But from what I read about SVE2 the size can be dynamic by 128 bit (16 bytes) increments.

So I have the feeling that padding by 16 bytes is a necessity to be safe (avoid crashs) but padding by 64 bytes (512 bits) can be a bit helpful for vectorized compute kernels to run more efficiently on such memmaped buffers. Going beyond is probably useless.

joblib/numpy_pickle.py

…-align

ogrisel

Thanks for the PR. However I don't like the implicit padding logic implemented redundantly at write and read time.

Instead I think we should adopt an explicit single byte code to store the effective padding size at write time (without the try / except io.UnsupportedOperation boilerplate for readability's sake):

current_pos = pickler.file_handle.tell()
alignment = (current_pos + 1) % NUMPY_ARRAY_ALIGNMENT_BYTES
padding_size = NUMPY_ARRAY_ALIGNMENT_BYTES - alignment
padding_bytecode = chr(padding_size).encode('ascii')
assert len(padding_bytecode) == 1  # should always hold
pickler.file_handle.write(padding_bytecode)  # always written
if padding_size != 0:
    padding = b'\x00' * padding_size
    pickler.file_handle.write(padding)

then at read time, again without the try / except io.UnsupportedOperation boilerplate:

padding_bytecode = unpickler.file_handle.read(1)
padding_size = ord(padding_bytecode.decode('ascii'))
offset += padding_size

Disclaimer: untested code snippet.

WDYT @lesteve?

EDIT: I think I had made a mistake in a first version of this code.

joblib/numpy_pickle.py

joblib/test/test_numpy_pickle.py

lesteve · 2022-02-04T16:22:56Z

Instead I think we should adopt an explicit prefix-code of the padding size at write time (without the try / except io.UnsupportedOperation boilerplate for readability's sake):

That does seem cleaner indeed. The only problem I see is that for old pickles (i.e. written without the change in this PR), there is no prefix code (the array bytes starts right away) so you are going to interpret the first array byte as the prefix code. Somehow there needs to be some kind of "sentinel" byte to indicate whether you have a new or old pickle file.

ogrisel · 2022-02-04T16:27:00Z

That does seem cleaner indeed. The only problem I see is that for old pickles (i.e. written without the change in this PR), there is no prefix code (the array bytes starts right away) so you are going to interpret the first array byte as the prefix code.

This is a valid concern indeed. Good catch.

Somehow there needs to be some kind of "sentinel" byte to indicate whether you have a new or old pickle file.

Yes we need to put some joblib-dump format versioning info somewhere at the beginning of the pickle file. But I don't think we can do it without breaking the pickle format. But we are already breaking the pickle format so maybe this is not a problem.

Edit: Actually we already do this kind of hack in _detect_compressor. So we could extend this and store a integer version number of the joblib format.

Each time we change the formatting, we could increase that number and explicitly maintain code-paths for backward compatibility with previous versions.

joblib/numpy_pickle.py

…ange of alignment

…handle that does not support tell

joblib/test/test_numpy_pickle.py

lesteve · 2022-02-17T15:56:20Z

I think this is ready enough for a more thorough review, removing the draft status.

Release 1.2.0 Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327 Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256 Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263 Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254 Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+. Vendor loky 3.3.0 which fixes several bugs including: robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269); avoiding leaking worker processes in case of nested loky parallel calls; reliability spawn the correct number of reusable workers.

lesteve added 4 commits February 2, 2022 16:36

wip

5b855cc

fix

232588b

test

282d953

Remove debug print

6b72bcf

lesteve added 3 commits February 3, 2022 12:06

fix (peek not supported in io.BytesIO)

ef5ddcd

better variables

a851f5c

handle case where .tell not supported

7cd90ed

ogrisel reviewed Feb 3, 2022

View reviewed changes

lesteve changed the title ~~Make sure arrays are 8 bytes aligned in joblib pickled~~ Make sure arrays are 8 bytes aligned in joblib pickles Feb 3, 2022

lesteve changed the title ~~Make sure arrays are 8 bytes aligned in joblib pickles~~ Make sure arrays are bytes aligned in joblib pickles Feb 3, 2022

lesteve added 4 commits February 3, 2022 15:59

Fix memmap with old pickles loaded in mmap mode

e0a994a

Use module variable for array bytes alignment

c8302dc

Fix for Windows

4952732

Merge branch 'master' of https://github.com/joblib/joblib into memmap…

36cfa85

…-align

ogrisel reviewed Feb 4, 2022

View reviewed changes

joblib/numpy_pickle.py Outdated Show resolved Hide resolved

joblib/test/test_numpy_pickle.py Outdated Show resolved Hide resolved

Fix test checking old non-aligned arrays with memmap

513e0d1

lesteve added 3 commits February 8, 2022 15:07

Add align_bytes_array attribute in NumpyArrayWrapper

d7cbd3f

lint

d66f2bd

Tweak

7dab61b

ogrisel reviewed Feb 9, 2022

View reviewed changes

joblib/numpy_pickle.py Outdated Show resolved Hide resolved

trigger CI

f943214

jjerphan reviewed Feb 10, 2022

View reviewed changes

joblib/numpy_pickle.py Show resolved Hide resolved

lesteve added 3 commits February 17, 2022 16:02

Trigger CI

fe5036d

Create a numpy_array_alignment_bytes attribute to allow for future ch…

0522f23

…ange of alignment

Add test for edge case where aligned numpy arrays are read in a file …

41dc185

…handle that does not support tell

lesteve commented Feb 17, 2022

View reviewed changes

joblib/test/test_numpy_pickle.py Outdated Show resolved Hide resolved

thomasjpfan mentioned this pull request Feb 27, 2022

check_regressors_train failure on master with the latest release on conda scikit-learn/scikit-learn#14106

Closed

lesteve deleted the memmap-align branch July 25, 2022 13:45

ivankatliarchuk mentioned this pull request Sep 17, 2022

chore(deps): update pypi requirements.txt (requirements.txt) (master) ivankatliarchuk/knowledge-base#362

Merged

1 task

renovate-bot mentioned this pull request Sep 23, 2022

chore(deps): update dependency joblib to v1.2.0 GoogleCloudPlatform/kubernetes-engine-samples#348

Merged

1 task

renovate bot mentioned this pull request Sep 25, 2022

Update dependency joblib to ~=1.4.0 Andrew1021/Comparison-QRS-Detectors#26

Open

1 task

mend-for-github-com bot mentioned this pull request Sep 27, 2022

Update dependency joblib to v1.2.0 moshemeidan/json#2

Open

1 task

This was referenced Sep 30, 2022

Update dependency joblib to v1.2.0 [SECURITY] - autoclosed Programmer-RD-AI/Find-Card#420

Closed

Update dependency joblib to v1 [SECURITY] aidenwang9867/Weibo-User-Depression-Detection-Dataset#68

Open

renovate bot mentioned this pull request Oct 17, 2022

Update dependency joblib to v1.2.0 [SECURITY] - autoclosed IQTLabs/AISonobuoy#995

Closed

1 task

renovate bot mentioned this pull request Oct 31, 2022

Update dependency joblib to v1.2.0 [SECURITY] - autoclosed dsi-icl/do-voice-interaction#134

Closed

1 task

This was referenced Nov 20, 2022

Update dependency joblib to v1 [SECURITY] Soumi7/HeartAttack-predictor-#72

Open

Update dependency joblib to v1.2.0 [SECURITY] Programmer-RD-AI/MCR-V3#56

Open

chore(deps): update dependency joblib to v1.1.1 [security] - autoclosed pplmx/di-ting#137

Closed

renovate bot mentioned this pull request Jan 13, 2023

chore(deps): update dependency joblib to v1 [security] ralphilius/sheetson-docs#8

Open

1 task

This was referenced Jul 6, 2023

Update dependency joblib to v1.2.0 [SECURITY] dsi-icl/do-voice-interaction#168

Open

Update dependency joblib to v1.2.0 [SECURITY] Programmer-RD-AI/Find-Card#475

Closed

chore(deps): update dependency joblib to v1 [security] laurentS/slowapi#161

Open

This was referenced Oct 26, 2023

Update dependency joblib to v1 [SECURITY] khulnasoft-lab/CyBERT#25

Open

Update dependency joblib to v1 [SECURITY] khulnasoft-lab/CyberBART#24

Merged

renovate bot mentioned this pull request Mar 21, 2024

Update dependency joblib to v1.2.0 [SECURITY] Hanra-s-work/point_one_robot_car#6

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure arrays are bytes aligned in joblib pickles #1254

Make sure arrays are bytes aligned in joblib pickles #1254

lesteve commented Feb 3, 2022

codecov bot commented Feb 3, 2022 •

edited

ogrisel left a comment

ogrisel Feb 3, 2022

ogrisel left a comment •

edited

lesteve commented Feb 4, 2022

ogrisel commented Feb 4, 2022 •

edited

lesteve commented Feb 17, 2022

Make sure arrays are bytes aligned in joblib pickles #1254

Make sure arrays are bytes aligned in joblib pickles #1254

Conversation

lesteve commented Feb 3, 2022

codecov bot commented Feb 3, 2022 • edited

Codecov Report

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Feb 3, 2022

Choose a reason for hiding this comment

ogrisel left a comment • edited

Choose a reason for hiding this comment

lesteve commented Feb 4, 2022

ogrisel commented Feb 4, 2022 • edited

lesteve commented Feb 17, 2022

codecov bot commented Feb 3, 2022 •

edited

ogrisel left a comment •

edited

ogrisel commented Feb 4, 2022 •

edited