dask.array.stack seems grow quadratically and is very slow, but dask.delayed(np.stack) grows linearly? #7116
-
Context for the question: I'm currently working through "Data Science with Python and Dask" from Manning Publications, and I'm stuck at chapter 10, where a trainset for dask_ml is created. My initial problem was that saving the trainset to a zarr file using the example code in the book stalled and never finished. I luckily managed to rewrite the code there using dask.delayed, which then saved the data in an acceptable time. As the whole thing puzzles me, I'd like to ask here what exactly might be the problem (I'm a beginner at dask). I broke the problem down to this example code: def random_entry():
return np.random.randint(low=0, high=2, size=(100, ))
if __name__ == "__main__":
for size in [5000, 10000, 20000, 40000]:
starttime = time.time()
dp = pd.DataFrame.from_dict({'features': [random_entry() for _ in range(size)]})
df = dask.dataframe.from_pandas(dp, npartitions=10)
preparetime = time.time() - starttime
starttime = time.time()
features = dask.array.stack(df['features']).rechunk(5000)
stacktime = time.time() - starttime
starttime = time.time()
features.compute()
computetime = time.time() - starttime
print("Data size {:6d}: Data preparation time {:4.2f}s, stack time {:4.2f}s, compute time {:4.2f}s".format(
size, preparetime, stacktime, computetime)) This gives me this output:
When I rewrite it with def random_entry():
return np.random.randint(low=0, high=2, size=(100, ))
if __name__ == "__main__":
for size in [5000, 10000, 20000, 40000]:
starttime = time.time()
dp = pd.DataFrame.from_dict({'features': [random_entry() for _ in range(size)]})
df = dask.dataframe.from_pandas(dp, npartitions=10)
preparetime = time.time() - starttime
starttime = time.time()
array_delayed = dask.delayed(np.stack)(df['features'])
features = dask.array.from_delayed(array_delayed, shape=(size, 100), dtype=np.int64).rechunk(5000)
stacktime = time.time() - starttime
starttime = time.time()
features.compute()
computetime = time.time() - starttime
print("Data size {:6d}: Data preparation time {:4.2f}s, stack time {:4.2f}s, compute time {:4.2f}s".format(
size, preparetime, stacktime, computetime)) I now get this output:
Not only does I run on a relatively new laptop with enough resources (16 GB RAM, 8-core i7-7700HQ), so resources should not be a problem. I don't observe any swapping or high CPU. I use these versions:
Any hint what is going on? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Interesting! I went to use dask stack on something large a couple of days ago and seemed like it was going to take a very long time. |
Beta Was this translation helpful? Give feedback.
-
Update: This pull request seems to have found the problem, looks like it really was an inefficiency in the dask code: I retried above code, now I get linear time when I use
Still slower than the |
Beta Was this translation helpful? Give feedback.
Update: This pull request seems to have found the problem, looks like it really was an inefficiency in the dask code:
#7402
I retried above code, now I get linear time when I use
dask.array.stack
:Still slower than the
delayed
code, but significant improvement.