Skip to content

Commit

Permalink
Add plots to illustrate what the optimizer does (#11072)
Browse files Browse the repository at this point in the history
Co-authored-by: Hendrik Makait <hendrik@makait.com>
  • Loading branch information
phofl and hendrikmakait committed Apr 30, 2024
1 parent cbf0ff0 commit 814ed3b
Show file tree
Hide file tree
Showing 3 changed files with 55 additions and 1 deletion.
14 changes: 13 additions & 1 deletion docs/source/dataframe-optimizer.rst
Expand Up @@ -4,7 +4,7 @@ Optimizer
.. currentmodule:: dask.dataframe

.. note::
Dask DataFrame supports Query Planning since version 2023.03.0
Dask DataFrame supports Query Planning since version 2024.03.0

Optimization steps
------------------
Expand Down Expand Up @@ -38,13 +38,25 @@ The optimizations entail the following steps (this list is not complete):
``df.groupby(...).apply(...)`` after a merge operation will not shuffle the data
again if the groupby happens on the merge columns.

.. figure:: images/optimizer/avoiding-shuffles.svg
:align: center

Similarly, performing two subsequent Joins/Merges on the same join-column(s) will avoid shuffling the
data again. The optimizer identifies that the partitioning of the DataFrame is already as
expected and thus simplifies the operation to a single Shuffle and a trivial merge operation.

- **Automatically resizing partitions:**

The IO layers automatically adjust the partition count based on the column subset
that is selected from the dataset. Very small partitions impact the scheduler and
expensive operations like shuffling negatively. This is addressed by adjusting
the partition count automatically.

.. figure:: images/optimizer/automatic-repartitioning.svg
:align: center

Selecting two columns that together have 40 MB per 200 MB file. The optimizer reduces the number of partitions by a factor of 5.

Exploring the optimized query
-----------------------------

Expand Down

0 comments on commit 814ed3b

Please sign in to comment.