Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for PyByteArray::new #4058

Open
ZHUI opened this issue Apr 8, 2024 · 5 comments
Open

Improve performance for PyByteArray::new #4058

ZHUI opened this issue Apr 8, 2024 · 5 comments

Comments

@ZHUI
Copy link

ZHUI commented Apr 8, 2024

Hello, recently i found the loading speed of safetensors is more than 50% slow than loading pickle.

huggingface/safetensors#460

for a 1GB numpy array, loading by pickle only need 1s, but safetensors need 1.8s

safetensors load: 1.8608193397521973
pickle      load: 1.004188060760498

Then, i profiled the performance, and found that, the most slower is calling PyByteArray::new, consuming almost all time!
https://github.com/ZHUI/safetensors/tree/performance/load_speed
ZHUI/safetensors@889c432

                let data =
                    &mmap[info.data_offsets.0 + self.offset..info.data_offsets.1 + self.offset];

                error!("get_tensor mmap done");

                let array: PyObject = Python::with_gil(|py| PyByteArray::new(py, data).into_py(py));

                error!("get_tensor PyByteArray");
[2024-04-08T03:40:56.506Z ERROR safetensors_rust] Open new statrt
load weight 1712547656.5070753
[2024-04-08T03:40:56.507Z ERROR safetensors_rust] 710 get_tensor
[2024-04-08T03:40:56.507Z ERROR safetensors_rust] get_tensor
[2024-04-08T03:40:56.507Z ERROR safetensors_rust] get_tensor mmap done
[2024-04-08T03:40:58.375Z ERROR safetensors_rust] get_tensor PyByteArray
[2024-04-08T03:40:58.375Z ERROR safetensors_rust] create_tensor done!
sf load: 1.9328322410583496

for PyByteArray::new(py, data);, the data is mmap address, the PyByteArray::new simple called PyByteArray_FromStringAndSize
https://github.com/python/cpython/blob/733e56ef9656dd79055acc2a3cecaf6054a45b6c/Objects/bytearrayobject.c#L134-L142

it seems just memcpy the mmap data to memory and the data is copying for disk->mem.

and i don't know why it is so slow, does it have additional mem->mem copy for this optional? thus disk->mem->mem

It there any way to create a PyByteArray from an exists memory without copy data?

@adamreichold
Copy link
Member

It there any way to create a PyByteArray from an exists memory without copy data?

I don't think that is possible, especially with PyByteArray which provides mutable access to the underlying data. Do you need mutability? If not, PyBytes might be a better choice but even that type will copy to move ownership of the data under the control of the Python heap.

I think a real zero-copy solution would be to implement a custom pyclass which implements the buffer protocol and is directly backed by the Mmap instance you already have. We recently discussed this in #4003 and we use something similar in rust-numpy, albeit as a NumPy base object, without the buffer protocol but used to perform type erasure.

@ZHUI
Copy link
Author

ZHUI commented Apr 8, 2024

@adamreichold
Thanks for you answer. I am not very familiar with rust.

even that type will copy to move ownership of the data under the control of the Python heap.

it mean the data will always copy since we pass data from rust to python?

I think a real zero-copy solution would be to implement a custom pyclass which implements the buffer protocol and is directly backed by the Mmap instance you already have. We recently discussed this in #4003 and we use something similar in rust-numpy, albeit as a NumPy base object, without the buffer protocol but used to perform type erasure.

https://github.com/PyO3/rust-numpy/blob/0832b28bf84a3f6053ca647a43625d715831db00/src/array.rs#L402-L421

I checked the PySliceContainer code, is it using rust data_ptr -> python PySliceContainer data_ptr addr -> python PY_ARRAY_API.PyArray_SetBaseObject to avoid memcpy?

        let mut dims = dims.into_dimension();
        let ptr = PY_ARRAY_API.PyArray_NewFromDescr(
            py,
            PY_ARRAY_API.get_type_object(py, npyffi::NpyTypes::PyArray_Type),
            T::get_dtype_bound(py).into_dtype_ptr(),
            dims.ndim_cint(),
            dims.as_dims_ptr(),
            strides as *mut npy_intp,    // strides
            data_ptr as *mut c_void,     // data
            npyffi::NPY_ARRAY_WRITEABLE, // flag
            ptr::null_mut(),             // obj
        );

        PY_ARRAY_API.PyArray_SetBaseObject(
            py,
            ptr as *mut npyffi::PyArrayObject,
            container as *mut ffi::PyObject,
        );

        Bound::from_owned_ptr(py, ptr).downcast_into_unchecked()

Can i use your PySliceContainer directly? or can your give me some pseudo-code to implement it?

@adamreichold
Copy link
Member

it mean the data will always copy since we pass data from rust to python?

Basically because someone has to free the data and know how to do that, i.e. in our case the main task of PySliceContainer is to call the correct drop functions when Python's reference count reaches zero because Python types like bytes do not know how to free Rust-allocated memory.

Can i use your PySliceContainer directly? or can your give me some pseudo-code to implement it?

I don't think so because this type is pretty much tied to backing a NumPy array by a rust-allocated Vec (or boxed slice), but if I understand you correctly you want bytes-like access to a Mmap instance.

I fear you will have to figure this out via https://pyo3.rs/v0.21.1/class/protocols.html?highlight=__getbuffer__#buffer-objects which is a thin layer over the CPython FFI objects until we provide a more ergonomic API to achieve this.

Thinking outside of the box, I am not sure how exactly you are using Rust here but you could also just use Python's mmap support directly and pass the resulting object to Rust as PyAny. But I am not sure how you want to access those bytes afterwards...

@ZHUI
Copy link
Author

ZHUI commented Apr 10, 2024

Thanks for you advice. For my case copy mmap is reasonable for load weight to MEM.

The core problem is memcpy for mmap memory is very slow. see:

https://stackoverflow.com/questions/52845387/improving-mmap-memcpy-file-read-performance

for my case, open(filename); f.read() is 2 GB/s, for memcpy(mmap(filename)) is 1.3 GB/s.

which is mach more slower than read file.

@adamreichold
Copy link
Member

This seems reasonable: If you are going to read the whole file, f.read() knows this intent and can optimize accordingly while memcpy(mmap(filename)) needs to figure this out via the page faults incurred.

I guess you could reduce the difference by applying madvice(.., MADV_SEQUENTIAL); to the memory map to tell the kernel about what you are going to do, but there is little point in modifying the process' address space if all you going to do is pull the data into a buffer. Bypassing the page cache via direct I/O might be useful if you really read it only once (and not repeatedly in an interactive session).

Using std::fs::read and IntoPyArray should also provide a performant approach to turn this into a NumPy array without additional copies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants