Improve performance for `PyByteArray::new` #4058

ZHUI · 2024-04-08T03:49:34Z

Hello, recently i found the loading speed of safetensors is more than 50% slow than loading pickle.

for a 1GB numpy array, loading by pickle only need 1s, but safetensors need 1.8s

safetensors load: 1.8608193397521973
pickle      load: 1.004188060760498

Then, i profiled the performance, and found that, the most slower is calling PyByteArray::new, consuming almost all time!
https://github.com/ZHUI/safetensors/tree/performance/load_speed
ZHUI/safetensors@889c432

                let data =
                    &mmap[info.data_offsets.0 + self.offset..info.data_offsets.1 + self.offset];

                error!("get_tensor mmap done");

                let array: PyObject = Python::with_gil(|py| PyByteArray::new(py, data).into_py(py));

                error!("get_tensor PyByteArray");

[2024-04-08T03:40:56.506Z ERROR safetensors_rust] Open new statrt
load weight 1712547656.5070753
[2024-04-08T03:40:56.507Z ERROR safetensors_rust] 710 get_tensor
[2024-04-08T03:40:56.507Z ERROR safetensors_rust] get_tensor
[2024-04-08T03:40:56.507Z ERROR safetensors_rust] get_tensor mmap done
[2024-04-08T03:40:58.375Z ERROR safetensors_rust] get_tensor PyByteArray
[2024-04-08T03:40:58.375Z ERROR safetensors_rust] create_tensor done!
sf load: 1.9328322410583496

for PyByteArray::new(py, data);, the data is mmap address, the PyByteArray::new simple called PyByteArray_FromStringAndSize
https://github.com/python/cpython/blob/733e56ef9656dd79055acc2a3cecaf6054a45b6c/Objects/bytearrayobject.c#L134-L142

it seems just memcpy the mmap data to memory and the data is copying for disk->mem.

and i don't know why it is so slow, does it have additional mem->mem copy for this optional? thus disk->mem->mem

It there any way to create a PyByteArray from an exists memory without copy data?

The text was updated successfully, but these errors were encountered:

adamreichold · 2024-04-08T06:22:25Z

It there any way to create a PyByteArray from an exists memory without copy data?

I don't think that is possible, especially with PyByteArray which provides mutable access to the underlying data. Do you need mutability? If not, PyBytes might be a better choice but even that type will copy to move ownership of the data under the control of the Python heap.

I think a real zero-copy solution would be to implement a custom pyclass which implements the buffer protocol and is directly backed by the Mmap instance you already have. We recently discussed this in #4003 and we use something similar in rust-numpy, albeit as a NumPy base object, without the buffer protocol but used to perform type erasure.

ZHUI · 2024-04-08T08:13:12Z

@adamreichold
Thanks for you answer. I am not very familiar with rust.

even that type will copy to move ownership of the data under the control of the Python heap.

it mean the data will always copy since we pass data from rust to python?

I think a real zero-copy solution would be to implement a custom pyclass which implements the buffer protocol and is directly backed by the Mmap instance you already have. We recently discussed this in #4003 and we use something similar in rust-numpy, albeit as a NumPy base object, without the buffer protocol but used to perform type erasure.

https://github.com/PyO3/rust-numpy/blob/0832b28bf84a3f6053ca647a43625d715831db00/src/array.rs#L402-L421

I checked the PySliceContainer code, is it using rust data_ptr -> python PySliceContainer data_ptr addr -> python PY_ARRAY_API.PyArray_SetBaseObject to avoid memcpy?

        let mut dims = dims.into_dimension();
        let ptr = PY_ARRAY_API.PyArray_NewFromDescr(
            py,
            PY_ARRAY_API.get_type_object(py, npyffi::NpyTypes::PyArray_Type),
            T::get_dtype_bound(py).into_dtype_ptr(),
            dims.ndim_cint(),
            dims.as_dims_ptr(),
            strides as *mut npy_intp,    // strides
            data_ptr as *mut c_void,     // data
            npyffi::NPY_ARRAY_WRITEABLE, // flag
            ptr::null_mut(),             // obj
        );

        PY_ARRAY_API.PyArray_SetBaseObject(
            py,
            ptr as *mut npyffi::PyArrayObject,
            container as *mut ffi::PyObject,
        );

        Bound::from_owned_ptr(py, ptr).downcast_into_unchecked()

Can i use your PySliceContainer directly? or can your give me some pseudo-code to implement it？

adamreichold · 2024-04-08T09:26:27Z

it mean the data will always copy since we pass data from rust to python?

Basically because someone has to free the data and know how to do that, i.e. in our case the main task of PySliceContainer is to call the correct drop functions when Python's reference count reaches zero because Python types like bytes do not know how to free Rust-allocated memory.

Can i use your PySliceContainer directly? or can your give me some pseudo-code to implement it？

I don't think so because this type is pretty much tied to backing a NumPy array by a rust-allocated Vec (or boxed slice), but if I understand you correctly you want bytes-like access to a Mmap instance.

I fear you will have to figure this out via https://pyo3.rs/v0.21.1/class/protocols.html?highlight=__getbuffer__#buffer-objects which is a thin layer over the CPython FFI objects until we provide a more ergonomic API to achieve this.

Thinking outside of the box, I am not sure how exactly you are using Rust here but you could also just use Python's mmap support directly and pass the resulting object to Rust as PyAny. But I am not sure how you want to access those bytes afterwards...

ZHUI · 2024-04-10T03:15:09Z

Thanks for you advice. For my case copy mmap is reasonable for load weight to MEM.

The core problem is memcpy for mmap memory is very slow. see:

https://stackoverflow.com/questions/52845387/improving-mmap-memcpy-file-read-performance

for my case， open(filename); f.read() is 2 GB/s, for memcpy(mmap(filename)) is 1.3 GB/s.

which is mach more slower than read file.

adamreichold · 2024-04-10T17:37:18Z

This seems reasonable: If you are going to read the whole file, f.read() knows this intent and can optimize accordingly while memcpy(mmap(filename)) needs to figure this out via the page faults incurred.

I guess you could reduce the difference by applying madvice(.., MADV_SEQUENTIAL); to the memory map to tell the kernel about what you are going to do, but there is little point in modifying the process' address space if all you going to do is pull the data into a buffer. Bypassing the page cache via direct I/O might be useful if you really read it only once (and not repeatedly in an interactive session).

Using std::fs::read and IntoPyArray should also provide a performant approach to turn this into a NumPy array without additional copies.

adamreichold mentioned this issue Apr 8, 2024

How create Array from exists Memory without copy? PyO3/rust-numpy#425

Open

ZHUI mentioned this issue Apr 15, 2024

Poor performance for reading Numpy huggingface/safetensors#460

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for `PyByteArray::new` #4058

Improve performance for `PyByteArray::new` #4058

ZHUI commented Apr 8, 2024 •

edited

adamreichold commented Apr 8, 2024

ZHUI commented Apr 8, 2024 •

edited

adamreichold commented Apr 8, 2024

ZHUI commented Apr 10, 2024

adamreichold commented Apr 10, 2024

Improve performance for PyByteArray::new #4058

Improve performance for PyByteArray::new #4058

Comments

ZHUI commented Apr 8, 2024 • edited

adamreichold commented Apr 8, 2024

ZHUI commented Apr 8, 2024 • edited

adamreichold commented Apr 8, 2024

ZHUI commented Apr 10, 2024

adamreichold commented Apr 10, 2024

Improve performance for `PyByteArray::new` #4058

Improve performance for `PyByteArray::new` #4058

ZHUI commented Apr 8, 2024 •

edited

ZHUI commented Apr 8, 2024 •

edited