Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forced rechunking #32

Open
aldanor opened this issue Oct 6, 2023 · 8 comments
Open

Forced rechunking #32

aldanor opened this issue Oct 6, 2023 · 8 comments

Comments

@aldanor
Copy link

aldanor commented Oct 6, 2023

This took me a while to figure out (since this was the last place I'd expect a forced rechunk to happen) - while passing huge frames from Python to Rust and back, noticed that they end up arriving in one chunk even if they were multi-chunked originally.

Is there any reason to not leave rechunking to the end-user? (since in some cases it may end up being very detrimental)

let ob = ob.call_method0("rechunk")?;

... and also this:

let s = self.0.rechunk();

@aldanor
Copy link
Author

aldanor commented Oct 6, 2023

Ok, edit, after a bit of reading...

IntoPy: basically, creates a pa.Array via pa.Array._import_from_c. In this case, if we have multiple chunks, can we simply do that for each chunk and then call pa.chunked_array(chunks)?

  • Problem is though, pl.from_arrow() seems to squash pa.ChunkedArray somewhere along the way anyways...

FromPyObject: more important but a bit more obscure:

  • The implementation is relying on Series.to_arrow() which always returns contiguous result
    • Because the first line in PySeries::to_arrow() calls self.rechunk(true)
  • DataFrame.to_arrow(), however, returns a chunked Table (!)
    • But FromPyObject for PyDataFrame doesn't/can't use it and instead does it series-by-series so you end up with rechunked columns anyways
    • If there's more than one chunk, can the dataframe be reconstructed as-is from record batches? (similar conversion is done in the other direction already in arrow_interop anyways)

@ritchie46
Copy link
Member

We should return ChunkedArrays's to arrow. That we don't probably was a bit of lazyness when I implemented this.

@aldanor
Copy link
Author

aldanor commented Oct 9, 2023

We should return ChunkedArrays's to arrow.

To arrow or from arrow? 🙂 (i.e. IntoPy or FromPyObject?)

If I'm reading it right btw, chunked-array API is not part of arrow's stable C API, is that part of the problem here?

// Yea, in some cases, this kind of rechunking may be catastrophic, e.g. if your dataframes are 50-100 GB, rechunk is the last thing you want to happen behind the scenes...

@ritchie46
Copy link
Member

But we could return a list of arrow arrays. 🤔 And then even use that to create a pyarrow ChunkedArray.

@aldanor
Copy link
Author

aldanor commented Oct 9, 2023

Yea, I think that should work.

Also a question then whether a single-chunk case should be special-cased or not (should it yield a list of one and produce a chunked array with a single batch, or a plain array)

@ritchie46
Copy link
Member

I think we can add a rechunk parameter and return an array if rechunk=True and otherwise always a list of arrays.

@aldanor
Copy link
Author

aldanor commented Oct 17, 2023

That sounds reasonable. The default being no rechunking?

@ritchie46
Copy link
Member

That sounds reasonable. The default being no rechunking?

Yes. Default to not exploding your memory. 🙈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants