Avoiding redundant re-computation with `apply_gufunc` #10591

noahprime · 2023-10-23T18:29:08Z

noahprime
Oct 23, 2023

So I've been working on a project that uses Dask to apply a custom function over chunks in my data using dask.distributed to parallelize the process.

I'd been using da.map_blocks() to do this. However I now want to add an additional return object which contains diagnostic information, so I've attempted to use da.apply_gufunc essentially like

result, diagnostic_result = da.apply_gufunc(
    my_custom_function,
    '(i),(i),(i),(i)->(),()', 
    da_one, 
    da_two,
    da_three,
    da_four, 
    output_dtypes=(float,float) 
)

However, I need to .compute() on both return values, triggering two computations, even though it's completely redundant (effectively every function call required to create the diagnostic result is called when creating the primary results). Is there a way to avoid this.

Alternatively, is there another proven way for saving diagnostic results (specifically in this case I want an n x n array of floats output) while using Dask distributed and da.map_blocks?

mrocklin · 2023-10-23T18:46:54Z

mrocklin
Oct 23, 2023
Maintainer

This might help: https://docs.dask.org/en/stable/best-practices.html#avoid-calling-compute-repeatedly

…

On Mon, Oct 23, 2023 at 1:29 PM Noah S. Prime ***@***.***> wrote: So I've been working on a project that uses Dask to apply a custom function over chunks in my data using dask.distributed to parallelize the process. I'd been using da.map_blocks() to do this. However I now want to add an additional return object which contains diagnostic information, so I've attempted to use da.apply_gufunc essentially like result, diagnostic_result = da.apply_gufunc( my_custom_function, '(i),(i),(i),(i)->(),()', da_one, da_two, da_three, da_four, output_dtypes=(float,float) ) However, I need to .compute() on both return values, triggering two computations, even though it's completely redundant (effectively every function call required to create the diagnostic result is called when creating the primary results). Is there a way to avoid this. Alternatively, is there another proven way for saving diagnostic results (specifically in this case I want an n x n array of floats output) while using Dask distributed and da.map_blocks? — Reply to this email directly, view it on GitHub <#10591>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTDJ2GTDM3HDGQBLLRTYA2ZQDAVCNFSM6AAAAAA6MQCWYSVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZVG43DSMRVHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

noahprime Oct 23, 2023
Author

Ah thank you so much that's great!

EDIT: Never mind. Figured it out by modifying how I was using the signature, which has been a bit confusing to learn how to use.

Somewhat related since I'm trying to move from map_blocks->apply_gufunc, is there a way to circumvent that chunk sizes need to be the same in apply_gufunc. In map_blocks I can have a dimension with just one chunk, and it doesn't matter if they're different lengths in the set of input arrays. It's strange that apply_gufunc won't let me pass data with the same number of chunks just because the lengths are different...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding redundant re-computation with `apply_gufunc` #10591

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Avoiding redundant re-computation with apply_gufunc #10591

noahprime Oct 23, 2023

Replies: 1 comment · 1 reply

mrocklin Oct 23, 2023 Maintainer

noahprime Oct 23, 2023 Author

Avoiding redundant re-computation with `apply_gufunc` #10591

noahprime
Oct 23, 2023

Replies: 1 comment 1 reply

mrocklin
Oct 23, 2023
Maintainer

noahprime Oct 23, 2023
Author