New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback - DataFrame query planning #10995
Comments
Hello, I'm wondering if I've stumbled into a bug/change in usage with updating from
The strange thing is that with
But actually try and access those values ...
Here's the data I'm trying to read as zipped parquet: test_data.zip I installed both dask versions using micromamba 1.5.7. |
Using
|
@thomas-fred I would consider it best practice to have an example prepared that does not involve sharing the actual parquet file regardless of how small it is. Parquet has been known to be vulnerable to arbitrary code execution. While the known issues have been fixed (see https://www.cve.org/CVERecord?id=CVE-2023-47248) it still requires a little trust to load a file from an otherwise unknown source. I tried reproducing what you're describing as import dask
import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({"foo": range(5)}, index=range(50, 55))
pdf.to_parquet("test.parquet")
dd.read_parquet("test.parquet").index.compute() and encountered this error dask/dask-expr#993 can you attempt to recreate your issue this way? |
@jackguac I have a fix for this up here dask/dask-expr#992 |
General feedback: I'm really excited about dask getting a query optimiser, but seems like there's a few sharp edges still. We (my work) have library with a bunch of data pipeline functions, some of which are fairly involved, and all of which use dask. I trialled bumping up the dask version today and our test suite hit 2 (previously passing) fails before hanging indefinitely. I've raised a couple bugs for the issues I can recreate- will keep digging around the specifics and raise any bugs I see (and put in a PR if I track down the specific cause). In the meantime, we're setting query planning to false and will check in again with the next dask version. Hope this is useful feedback btw, very excited about dask expressions becoming a thing. |
Hello @fjetter, thanks for getting back so promptly. I didn't know about arbitrary code execution issues with parquet files. I knew that my example wasn't ideal, but I also wanted to post something before I stopped work for the weekend. Anyway, I have played with your example and can reproduce your I think the specific problem I ran into concerns accessing the values of a string index: import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({"foo": range(3)}, index=["a", "b", "c"])
pdf.to_parquet("test.pq")
df = dd.read_parquet("test.pq")
print(f"{df.index.compute()=}")
print(f"{df.index.compute().values=}") |
Hi @thomas-fred thanks for providing the reproducer. We have a fix here: dask/dask-expr#1000 |
hey folks :) I'm getting a 404 error when trying to access the docs for |
Sorry about that, https://docs.dask.org/en/stable/dataframe-api.html we switched the link |
Here is the Documentation feedback for Dask data frame query planning: 1.1 Reading the dataset with Dask and obtaining the time it takes to query the “retail value” column from the dataset: Function to measure the time taken to query 'RetailValue' column from CSV files with Daskdef query_retail_value(directory, file_names):
Directory containing the CSV files: List of file names: Measure the time taken to query 'RetailValue' column from CSV files with Dask: Example: Print the first 10 retail values from each dataset: 1.2 Reading the dataset with pandas and obtaining the time it takes to query the “retail value” column from the dataset: Function to measure the time taken to query 'RetailValue' column from CSV files with Pandasdef query_retail_value_with_pandas(directory, file_names):
The directory containing the CSV files: List of file names: Measure the time taken to query the 'RetailValue' column from CSV files with Pandas: Example: Print the first 10 retail values from each dataset: |
@kritikakshirsagar03 Could you add a bit more context what you want to tell us with this? |
xref dask/dask-expr#1060 -- looks like a fix is already in the works! |
The latest release
2024.3.0
enabled query planning forDataFrame
s by default. This issue can be used to report feedback and ask related questions.If you encountered a bug or unexpected behavior, please check if you got the most recent version of
dask-expr
installed. This is a separate package with a decoupled release process allowing us to roll out hotfixes quickly.If you are still encountering issues after an update, please open an issue with a reproducer and we will respond as soon as possible.
See below for a list of known issues and/or check the issue tracker
Brief introduction
The legacy DataFrame implementation did not offer a way to optimize your query and
dask
executed whatever you requested literally. In many situations this was suboptimal and could cause significant performance overhead.Let's take a simple example by using the NYC Uber/Lyft dataset
This query loads the data, applies a filter on the vendor, calculates a sum and picks a single column for the result. The legacy
DataFrame
would load all data, only then apply the filter, compute the sum over all columns and then throw all the data away since we're only interested in a single column.Advanced users may be tempted to rewrite this to an optimized version that provides the required columns and filters to the
read_parquet
call already but this is now done automatically for you. After optimization, the above query is identical towhich loads only the required columns from storage and applies filters even on rowgroup level, if applicable.
Column projection and predicate pushdown are only some of the most obvious optimizations that are performed automaticaly for you.
See for yourself and let us know what you think.
Known issues
dask-expr
is not installed when updating daskWhen using a package manager like
pip
it can happen that an upgrade likepip update dask
does not pull in the extra dependencies ofdask.dataframe
. To ensure all dependencies are installed, please installdask[dataframe]
or useconda install dask
Pandas copy-on-write enabled as an import side effect
Prior to version dask-expr==1.0.2, importing dask.dataframe set the pandas option copy-on-write to True
See also #10996
The text was updated successfully, but these errors were encountered: