Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turning off query planning is difficult #11070

Open
jtilly opened this issue Apr 25, 2024 · 1 comment
Open

Turning off query planning is difficult #11070

jtilly opened this issue Apr 25, 2024 · 1 comment
Labels
needs triage Needs a response from a contributor

Comments

@jtilly
Copy link

jtilly commented Apr 25, 2024

I mentioned this in #11067, but maybe this deserves its own issue: I find it difficult to turn off query planning using the Python API. Using dask.config.set only works if dask.dataframe hasn't been imported up until that point.

This works (query planner is turned off):

import dask
import pandas as pd

with dask.config.set({"dataframe.query-planning": False}):
    import dask.dataframe as dd  # delayed import
    ddf = dd.from_pandas(pd.DataFrame({"x": [1, 2, 3]}), chunksize=1)
    out = ddf.mean()
    print(hasattr(out, "_expr"))  # False

This doesn't work (query planner is turned on):

import dask
import pandas as pd
import dask.dataframe as dd  # I moved this import up

with dask.config.set({"dataframe.query-planning": False}):
    ddf = dd.from_pandas(pd.DataFrame({"x": [1, 2, 3]}), chunksize=1)
    out = ddf.mean()
    print(hasattr(out, "_expr"))  # True

When I write library code, I typically can't control what users already imported.

Yes, adding dd = importlib.reload(dd) inside the context also fixes the issue in this example, but that doesn't work in all settings.

E.g., imagine that I write a library with a function that some user can call with a dask dataframe:

# my library code
import dask

def my_func(ddf):
    with dask.config.set({"dataframe.query-planning": False}):
        # reloading dask.dataframe here of course doesn't make a difference
        out = ddf.mean()
        print(hasattr(out, "_expr"))  # True
# user code
import dask.dataframe as dd
import pandas as pd

from my_library import my_func

ddf = dd.from_pandas(pd.DataFrame({"x": [1, 2, 3]}), chunksize=1)
my_func(ddf)

Is there a clever way around this? I have drastic ideas like building a conda package that sets the DASK_DATAFRAME__QUERY_PLANNING environment variable, but that might be a bit much. I would much rather turn the query planner off selectively.

As an aside: maybe all of this is moot once #11067 is fixed.

@github-actions github-actions bot added the needs triage Needs a response from a contributor label Apr 25, 2024
@jtilly
Copy link
Author

jtilly commented Apr 26, 2024

Here's another example, where I struggle to turn off query planning: I'm using plateau to load a dask dataframe from disk:

import dask
import pandas as pd
from tempfile import TemporaryDirectory
from plateau.io.dask.dataframe import read_dataset_as_ddf
from plateau.io.eager import store_dataframes_as_dataset

dataset_dir = TemporaryDirectory()
store_url = f"hfs://{dataset_dir.name}"


def create_plateau_dataset():
    store_dataframes_as_dataset(
        dfs=[pd.DataFrame({"x": [1, 2, 3], "y": [3, 4, 5]})],
        dataset_uuid="abc123",
        store=store_url,
    )


if __name__ == "__main__":

    create_plateau_dataset()

    with dask.config.set({"dataframe.query-planning": False}):
        ddf = read_dataset_as_ddf(dataset_uuid="abc123", store=store_url)
        print(hasattr(ddf, "_expr"))  # True

Here, I wouldn't know which modules to reload to get the desired behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

No branches or pull requests

1 participant