Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating from current dask/dataframe implementation to dask-expr #361

Open
phofl opened this issue Jan 18, 2024 · 3 comments
Open

Migrating from current dask/dataframe implementation to dask-expr #361

phofl opened this issue Jan 18, 2024 · 3 comments

Comments

@phofl
Copy link

phofl commented Jan 18, 2024

We've been working on adding a Query optimization layer to Dask DataFrames for a while now. The project live at https://github.com/dask-contrib/dask-expr

The status quo can be summarised as follows:

  • dask-expr has close to full coverage of the Dask DataFrame API (there are at most a handful of seldomly used methods that aren't supported, thinks like melt)
  • A switch to opt into dask-expr from within dask/dataframe exists since 2023.12.1
  • The dask test suite is almost green (we will most likely finish this today or tomorrow)

The next step is to think about how we can flip users from the legacy implementation to the expression based implementation.

My suggestion is something along the lines:

  • Add a deprecation warning the init of dask/dataframe that we will flip the switch one or two releases before we actually switch over
  • prominent section in the release notes and announcement in issues and discourse

This issue is mostly meant to gather feedback about how we should approach this

@TomAugspurger
Copy link
Member

Thanks @phofl, a couple quick questions:

First, is there a list of known incompatibilities (I guess https://github.com/dask-contrib/dask-expr?tab=readme-ov-file#api-coverage, if it's up to date?). It'd be good to include that in the migration guide.

Second, on the deprecation warning

Add a deprecation warning the init of dask/dataframe

Do you have specific wording in mind? Is the intent to deprecate import dask.dataframe permanently? Or is the intent to swap out the implementation of dask.dataframe with dask.expr, but keep the name dask.dataframe for user-facing code? I think this might interact with the plan for continued development of dask-expr (whether it'll be in dask/dask, and releases will be synchronized with dask, or whether it'll continue to be developed independently).

And what's the plan for packaging this up? Will you make dask-expr a required dependency of dask[dataframe] before the switch? That would make it easier for users to adapt, since the dependency would already exist. (I guess this would interact with the plan for future development of dask-expr).

It's great to see all the progress here!

@phofl
Copy link
Author

phofl commented Jan 31, 2024

First, is there a list of known incompatibilities (I guess https://github.com/dask-contrib/dask-expr?tab=readme-ov-file#api-coverage, if it's up to date?). It'd be good to include that in the migration guide.

Yes that's up to date

Do you have specific wording in mind? Is the intent to deprecate import dask.dataframe permanently? Or is the intent to swap out the implementation of dask.dataframe with dask.expr, but keep the name dask.dataframe for user-facing code? I think this might interact with the plan for continued development of dask-expr (whether it'll be in dask/dask, and releases will be synchronized with dask, or whether it'll continue to be developed independently).

dask.dataframe will stay, we just want to swap out the implementation. dask-expr will probably eventually end up in dask/dask, but for now fast ci times and not that much baggage are more helpful than merging the repositories

And what's the plan for packaging this up? Will you make dask-expr a required dependency of dask[dataframe] before the switch? That would make it easier for users to adapt, since the dependency would already exist. (I guess this would interact with the plan for future development of dask-expr).

That's something that we haven't discussed in detail, but this is not trivial to do since dask-expr requires pandas >= 2, so the dependencies would conflict, We currently raise an error that users have to install dask-expr if they enable query planning and it's not there

@TomAugspurger
Copy link
Member

Makes sense, thanks.

I think the main thing to keep an eye on is how long we leave dask.dataframe giving that warning about using dask.expr. I'd say a short time (maybe a few months tops) would be appropriate. I wouldn't want to go too long with users learning dask from older tutorials telling them to import dask.dataframe and have it give a warning.

Maybe even before it's merged into dask/dask, we could have dask[dataframe] depend on dask-expr and just have dask.dataframe.__init__ do a from dask_expr import *.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants