dask.array.stack seems grow quadratically and is very slow, but dask.delayed(np.stack) grows linearly? #7116

SteffenBauer · 2021-01-26T16:42:53Z

SteffenBauer
Jan 26, 2021

Context for the question: I'm currently working through "Data Science with Python and Dask" from Manning Publications, and I'm stuck at chapter 10, where a trainset for dask_ml is created. My initial problem was that saving the trainset to a zarr file using the example code in the book stalled and never finished. I luckily managed to rewrite the code there using dask.delayed, which then saved the data in an acceptable time.

As the whole thing puzzles me, I'd like to ask here what exactly might be the problem (I'm a beginner at dask). I broke the problem down to this example code:

def random_entry():
    return np.random.randint(low=0, high=2, size=(100, ))

if __name__ == "__main__":
    for size in [5000, 10000, 20000, 40000]:
        starttime = time.time()
        dp = pd.DataFrame.from_dict({'features': [random_entry() for _ in range(size)]})
        df = dask.dataframe.from_pandas(dp, npartitions=10)
        preparetime = time.time() - starttime

        starttime = time.time()
        features = dask.array.stack(df['features']).rechunk(5000)
        stacktime = time.time() - starttime

        starttime = time.time()
        features.compute()
        computetime = time.time() - starttime

        print("Data size {:6d}: Data preparation time {:4.2f}s, stack time {:4.2f}s, compute time {:4.2f}s".format(
                size, preparetime, stacktime, computetime))

This gives me this output:

Data size   5000: Data preparation time 0.15s, stack time 1.45s, compute time 4.38s
Data size  10000: Data preparation time 0.25s, stack time 2.94s, compute time 15.34s
Data size  20000: Data preparation time 0.51s, stack time 5.78s, compute time 62.48s
Data size  40000: Data preparation time 0.99s, stack time 11.76s, compute time 266.27s

When I rewrite it with dask.delayed (I checked, the generated array is the same):

def random_entry():
    return np.random.randint(low=0, high=2, size=(100, ))

if __name__ == "__main__":
    for size in [5000, 10000, 20000, 40000]:
        starttime = time.time()
        dp = pd.DataFrame.from_dict({'features': [random_entry() for _ in range(size)]})
        df = dask.dataframe.from_pandas(dp, npartitions=10)
        preparetime = time.time() - starttime

        starttime = time.time()
        array_delayed = dask.delayed(np.stack)(df['features'])
        features = dask.array.from_delayed(array_delayed, shape=(size, 100), dtype=np.int64).rechunk(5000)
        stacktime = time.time() - starttime

        starttime = time.time()
        features.compute()
        computetime = time.time() - starttime

        print("Data size {:6d}: Data preparation time {:4.2f}s, stack time {:4.2f}s, compute time {:4.2f}s".format(
                size, preparetime, stacktime, computetime))

I now get this output:

Data size   5000: Data preparation time 0.15s, stack time 0.00s, compute time 0.02s
Data size  10000: Data preparation time 0.29s, stack time 0.00s, compute time 0.03s
Data size  20000: Data preparation time 0.48s, stack time 0.00s, compute time 0.05s
Data size  40000: Data preparation time 0.95s, stack time 0.01s, compute time 0.10s

Not only does dask.array.stack seem to grow with O(N^2), but it is also getting inacceptably slow even for only a very moderate amount of data. The example in the book chapter I mentioned above had 500K entries, so no wonder that the code there never finished in reasonable time. What also makes me wonder is that the author wrote there something like "we quickly save the trainset to a zarr file...", so it looks like this inefficiency didn't occur with some older dask version.

I run on a relatively new laptop with enough resources (16 GB RAM, 8-core i7-7700HQ), so resources should not be a problem. I don't observe any swapping or high CPU. I use these versions:

dask                              2021.1.1
pandas                            1.2.1
numpy                             1.18.5
Python 3.7.3
Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux

Any hint what is going on?

Answered by SteffenBauer

Apr 3, 2021

Update: This pull request seems to have found the problem, looks like it really was an inefficiency in the dask code:

#7402

I retried above code, now I get linear time when I use dask.array.stack:

Data size   5000: Data preparation time 0.14s, stack time 1.61s, compute time 1.20s
Data size  10000: Data preparation time 0.41s, stack time 3.03s, compute time 2.28s
Data size  20000: Data preparation time 0.52s, stack time 6.16s, compute time 5.73s
Data size  40000: Data preparation time 1.11s, stack time 13.32s, compute time 12.92s

Still slower than the delayed code, but significant improvement.

View full answer

RichardScottOZ · 2021-04-03T00:51:07Z

RichardScottOZ
Apr 3, 2021

Interesting! I went to use dask stack on something large a couple of days ago and seemed like it was going to take a very long time.

0 replies

SteffenBauer · 2021-04-03T19:20:16Z

SteffenBauer
Apr 3, 2021
Author

Update: This pull request seems to have found the problem, looks like it really was an inefficiency in the dask code:

#7402

I retried above code, now I get linear time when I use dask.array.stack:

Data size   5000: Data preparation time 0.14s, stack time 1.61s, compute time 1.20s
Data size  10000: Data preparation time 0.41s, stack time 3.03s, compute time 2.28s
Data size  20000: Data preparation time 0.52s, stack time 6.16s, compute time 5.73s
Data size  40000: Data preparation time 1.11s, stack time 13.32s, compute time 12.92s

Still slower than the delayed code, but significant improvement.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dask.array.stack seems grow quadratically and is very slow, but dask.delayed(np.stack) grows linearly? #7116

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

dask.array.stack seems grow quadratically and is very slow, but dask.delayed(np.stack) grows linearly? #7116

SteffenBauer Jan 26, 2021

Replies: 2 comments

RichardScottOZ Apr 3, 2021

SteffenBauer Apr 3, 2021 Author

SteffenBauer
Jan 26, 2021

RichardScottOZ
Apr 3, 2021

SteffenBauer
Apr 3, 2021
Author