Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] What is the timeline for dask.dataframe deprecation #10934

Open
rjzamora opened this issue Feb 16, 2024 · 8 comments
Open

[DISCUSSION] What is the timeline for dask.dataframe deprecation #10934

rjzamora opened this issue Feb 16, 2024 · 8 comments
Labels
dataframe deprecation Something is being removed discussion Discussing a topic with no specific actions yet

Comments

@rjzamora
Copy link
Member

rjzamora commented Feb 16, 2024

Many users and down-stream libraries were a bit surprised to see a loud deprecation warning when importing dask.dataframe after the 2024.2.0 release. The dask-expr migration was certainly obvious for anyone watching github. However, the discussion/decision over the specific timeline was largely internal to Coiled.

Could we use this issue to establish a basic timeline for users and down-stream libraries to use as a reference? Note that I am not asking that we try to reach a consensus on these kinds of decisions. It would just be very useful to know what the plan is (so it can be communicated easily to others).

Critical Questions:

  • What is the earliest date that the "dataframe.query-planning" default will change from "False" to "True"? For example, will it be 2024.2.1, or is the plan to do this in 2024.3.0 or later?
  • What is the earliest date that "dataframe.query-planning": "False" will be disabled entirely?
@rjzamora rjzamora added dataframe discussion Discussing a topic with no specific actions yet deprecation Something is being removed labels Feb 16, 2024
@fjetter
Copy link
Member

fjetter commented Feb 19, 2024

Thanks for opening the issue. First of all, this is all up for debate. Nothing here has been definitively decided.

Our intention is currently to enable the query planning as soon as possible. We feel good about the current performance and stability. However, we won't be able to release it with full API coverage and are rather focusing on the most important APIs (e.g. currently dask-expr does not support something like DataFrame.meld or named groupby aggregations). Considering that users can always opt-out we believe that this is the approach that is beneficial to most users. At this point we already believe that dask-expr is better for most users than the legacy DataFrame API.

There are two missing features that possibly lock out a larger number of users than we'd feel comfortable with. These are

Giving a commitment for a specific release is difficult but given the current release schedule I consider 2024.3.0 (release date 2024-03-01) possible but a little optimistic. I am confident that we can manage 2024.3.1 (release date 2024-03-15). If we were to cut both feature we could flip the switch right now already.

Edit: I got mixed up in my calendar. I suspect the most realistic release date will be 2024.3.0 which should happen on 2024-03-08. If we reduce scope and are fine without annotations/scheduler integrations we can go sooner.

What is the earliest date that "dataframe.query-planning": "False" will be disabled entirely?

There hasn't been any decision about this, yet. My current assumption is that we'll hold on to this for a while until we're certain that we won't cut out larger user groups.
While it would be nice to be able to delete the old DataFrame code we're not in a rush considering that the old HLG backend is still in use for Arrays and Bags.

Please let us know if anything here sounds concerning or problematic. We're also interested if this all sounds too careful or too reckless :)

@fjetter
Copy link
Member

fjetter commented Feb 19, 2024

The conversation about annotations is happening over in #10937

@mrocklin
Copy link
Member

In conversation @fjetter mentioned to me that we should probably try things out with xgboost.dask and make sure that that project is ok post-transition.

@mrocklin
Copy link
Member

For context with xgboost, they specify workers, but only after they've already converted to futures, which seems pretty safe for dask-xgboost.

@rjzamora
Copy link
Member Author

Linking dask/community#361 (sorry - just saw that issue now)

@fjetter
Copy link
Member

fjetter commented Feb 23, 2024

(sorry - just saw that issue now)

my fault. I only opened that one now 😅

@fjetter
Copy link
Member

fjetter commented Feb 27, 2024

We're currently seeing a couple of weird recursive import errors when using dask-expr in our coiled benchmarks test suite, see coiled/benchmarks#1419 This is something we definitely want to fix or at least have better understood before moving forward.
This test suite is also running xgboost and from what we can tell, it is running as expected. We encountered an error in dask-ml related to wrong imports but otherwise no other issues popped up, yet.

Therefore, I suggest to not block on any of the above issues, i.e. neithe on the annotations #10937 nor on the scheduler integration dask/dask-expr#14

This would mean that the next release would have dask-expr enabled by default. In preparation of this, I propose to change the default on main as soon as possible to give downstream projects a chance to test against this. If any medium sized blockers pop up we'd postpone the release until those are fixed. If anything major comes up, we could still revert the toggle if necessary.

This leaves the question about what to do with pandas 1.X support. I opened another issue for this #10962

@phofl
Copy link
Collaborator

phofl commented Feb 28, 2024

We're currently seeing a couple of weird recursive import errors when using dask-expr in our coiled benchmarks test suite, see coiled/benchmarks#1419 This is something we definitely want to fix or at least have better understood before moving forward.

This is fixed now on main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe deprecation Something is being removed discussion Discussing a topic with no specific actions yet
Projects
None yet
Development

No branches or pull requests

4 participants