Scattering Functions That Use Large Arrays #4322

elliottperryman · 2020-12-05T16:52:17Z

elliottperryman
Dec 5, 2020

Hi all,

I've got a question that may turn out to be an easy problem, but I'm not sure. My use case is this - I'm using dask distributed to map work across many nodes and be flexible with where it can run (which dask has been really great for!). However, I'm not sure how to scatter a large function across a client. Here's the code that would reproduce this effect (increase the size of big_data or number of processors in the client to see this more):

import numpy as np
from dask.distributed import Client, as_completed

def my_code(func, x, client):
    # x is not big, but the func calls a big array 
    # is there a way to scatter func? 
    futures = client.map(func, x)
    # just does some stupid work 
    max_val = -1
    for res in as_completed(futures):
        if res.result() > max_val:
            max_val = res.result()
    return max_val

def user_code():
    big_data = np.random.random((10**3,10**3))
    def big_func(x):
        if np.sum(big_data)<0: print('this is just to call the big array')
        return x**2
    client = Client()
    print('dashboard:',client.dashboard_link)
    best = my_code(big_func, range(101), client)
    print('best:',best)
    client.close()

def main():
    user_code()

if __name__ == "__main__":
    main()

What do you all think? I am not sure if it is possible, since my code doesn't know about big_data. In its real use case, client would spread across many machines, so the effect of each parallel application referencing the same memory would be even bigger (at least that's my interpretation).
Thanks,
Elliott

SultanOrazbayev · 2020-12-11T07:54:54Z

SultanOrazbayev
Dec 11, 2020

Have you tried scatter on big_data? The modification below works on my end, but maybe this is not useful for your case?

import numpy as np
from dask.distributed import Client, as_completed

def my_code(func, x, client):
    # x is not big, but the func calls a big array 
    # is there a way to scatter func? 
    futures = client.map(func, x)
    # just does some stupid work 
    max_val = -1
    for res in as_completed(futures):
        if res.result() > max_val:
            max_val = res.result()
    return max_val

def user_code():
    big_data = np.random.random((10**3,10**3))
    def big_func(x):
        if np.sum(big_data.result())<0: print('note .result() to convert the future')
        return x**2
    client = Client()
    print('dashboard:',client.dashboard_link)
    big_data = client.scatter(big_data) # if every worker needs it, you can use broadcast=True
    best = my_code(big_func, range(101), client)
    print('best:',best)
    client.close()

def main():
    user_code()

if __name__ == "__main__":
    main()

As a backup strategy, it's possible to change the work to write big_data to disk. If you have a very large number of workers accessing big_data simultaneously, it might be worth creating several copies of the files and ask different workers to access different copies.

1 reply

elliottperryman Dec 17, 2020
Author

Thanks for the reply! I was worried that the answer would be for user_code to scatter the data and use futures in their functions, since asking users to do more complicated functions is always a hurdle. However, I'm more relieved that there wasn't some trick I was missing. Writing to file is a good idea.

Thanks for the help!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scattering Functions That Use Large Arrays #4322

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Scattering Functions That Use Large Arrays #4322

elliottperryman Dec 5, 2020

Replies: 1 comment · 1 reply

SultanOrazbayev Dec 11, 2020

elliottperryman Dec 17, 2020 Author

elliottperryman
Dec 5, 2020

Replies: 1 comment 1 reply

SultanOrazbayev
Dec 11, 2020

elliottperryman Dec 17, 2020
Author