Caching when using a distributed scheduler #7175

xinrong-meng · 2021-02-04T23:50:09Z

xinrong-meng
Feb 4, 2021

Let's say we have a large Dask DataFrame df on a distributed cluster, and we filter df using filter_expr.

And then we have a bunch of operations on df[filter_expr], such as count, mean and groupby, etc.

Here, calling df[filter_expr].compute() to cache doesn't work because df[filter_expr] doesn't fit into memory.

Opportunistic caching also doesn't seem to work because we are using a distributed scheduler.

In this case, what would be the best caching mechanism?

Answered by andersy005

Feb 5, 2021

@xinrong-databricks,

Here, calling df[filter_expr].compute() to cache doesn't work because df[filter_expr] doesn't fit into memory.

have you tried using .persist() i.e. df[filter_expr].persist() to persist your results in the your cluster's distributed memory?

View full answer

andersy005 · 2021-02-05T01:35:25Z

andersy005
Feb 5, 2021

@xinrong-databricks,

Here, calling df[filter_expr].compute() to cache doesn't work because df[filter_expr] doesn't fit into memory.

have you tried using .persist() i.e. df[filter_expr].persist() to persist your results in the your cluster's distributed memory?

3 replies

xinrong-meng Feb 6, 2021
Author

Thank you, @andersy005!

May I take it as compute() uses scheduler's memory only?

Is there a function that blocks until the computation is finished (like compute()) and utilizes the cluster's distributed memory(like persist())?

andersy005 Feb 6, 2021

May I take it as compute() uses scheduler's memory only?

With .compute(), the data/results are moved to wherever the client resides.

Is there a function that blocks until the computation is finished (like compute()) and utilizes the cluster's distributed memory(like persist())?

there's a wait() function for this purpose:

In [5]: from distributed import Client, wait

In [6]: client = Client()

In [9]: df = client.persist(df)

In [10]: wait(df)

xinrong-meng Feb 10, 2021
Author

Thank you @andersy005, that's helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching when using a distributed scheduler #7175

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Caching when using a distributed scheduler #7175

xinrong-meng Feb 4, 2021

Replies: 1 comment · 3 replies

andersy005 Feb 5, 2021

xinrong-meng Feb 6, 2021 Author

andersy005 Feb 6, 2021

xinrong-meng Feb 10, 2021 Author

xinrong-meng
Feb 4, 2021

Replies: 1 comment 3 replies

andersy005
Feb 5, 2021

xinrong-meng Feb 6, 2021
Author

xinrong-meng Feb 10, 2021
Author