Turning off query planning is difficult #11070

jtilly · 2024-04-25T13:24:11Z

I mentioned this in #11067, but maybe this deserves its own issue: I find it difficult to turn off query planning using the Python API. Using dask.config.set only works if dask.dataframe hasn't been imported up until that point.

This works (query planner is turned off):

import dask
import pandas as pd

with dask.config.set({"dataframe.query-planning": False}):
    import dask.dataframe as dd  # delayed import
    ddf = dd.from_pandas(pd.DataFrame({"x": [1, 2, 3]}), chunksize=1)
    out = ddf.mean()
    print(hasattr(out, "_expr"))  # False

This doesn't work (query planner is turned on):

import dask
import pandas as pd
import dask.dataframe as dd  # I moved this import up

with dask.config.set({"dataframe.query-planning": False}):
    ddf = dd.from_pandas(pd.DataFrame({"x": [1, 2, 3]}), chunksize=1)
    out = ddf.mean()
    print(hasattr(out, "_expr"))  # True

When I write library code, I typically can't control what users already imported.

Yes, adding dd = importlib.reload(dd) inside the context also fixes the issue in this example, but that doesn't work in all settings.

E.g., imagine that I write a library with a function that some user can call with a dask dataframe:

# my library code
import dask

def my_func(ddf):
    with dask.config.set({"dataframe.query-planning": False}):
        # reloading dask.dataframe here of course doesn't make a difference
        out = ddf.mean()
        print(hasattr(out, "_expr"))  # True

# user code
import dask.dataframe as dd
import pandas as pd

from my_library import my_func

ddf = dd.from_pandas(pd.DataFrame({"x": [1, 2, 3]}), chunksize=1)
my_func(ddf)

Is there a clever way around this? I have drastic ideas like building a conda package that sets the DASK_DATAFRAME__QUERY_PLANNING environment variable, but that might be a bit much. I would much rather turn the query planner off selectively.

As an aside: maybe all of this is moot once #11067 is fixed.

The text was updated successfully, but these errors were encountered:

jtilly · 2024-04-26T13:34:13Z

Here's another example, where I struggle to turn off query planning: I'm using plateau to load a dask dataframe from disk:

import dask
import pandas as pd
from tempfile import TemporaryDirectory
from plateau.io.dask.dataframe import read_dataset_as_ddf
from plateau.io.eager import store_dataframes_as_dataset

dataset_dir = TemporaryDirectory()
store_url = f"hfs://{dataset_dir.name}"


def create_plateau_dataset():
    store_dataframes_as_dataset(
        dfs=[pd.DataFrame({"x": [1, 2, 3], "y": [3, 4, 5]})],
        dataset_uuid="abc123",
        store=store_url,
    )


if __name__ == "__main__":

    create_plateau_dataset()

    with dask.config.set({"dataframe.query-planning": False}):
        ddf = read_dataset_as_ddf(dataset_uuid="abc123", store=store_url)
        print(hasattr(ddf, "_expr"))  # True

Here, I wouldn't know which modules to reload to get the desired behavior.

github-actions bot added the needs triage Needs a response from a contributor label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turning off query planning is difficult #11070

Turning off query planning is difficult #11070

jtilly commented Apr 25, 2024

jtilly commented Apr 26, 2024

Turning off query planning is difficult #11070

Turning off query planning is difficult #11070

Comments

jtilly commented Apr 25, 2024

jtilly commented Apr 26, 2024