Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] projection_push_down optimizer rule fails in some queries #1012

Closed
ChrisJar opened this issue Jan 26, 2023 · 5 comments
Closed

[BUG] projection_push_down optimizer rule fails in some queries #1012

ChrisJar opened this issue Jan 26, 2023 · 5 comments
Labels
bug Something isn't working datafusion Related to work in DataFusion

Comments

@ChrisJar
Copy link
Collaborator

What happened:
When running or even just explaining basic INTERSECT queries like:

import pandas as pd
from dask_sql import Context

df = pd.DataFrame({"a":[1,2,2,3,3], "b":[2,3,3,4,4]})
c = Context()
c.create_table("df", df)
c.explain("SELECT a FROM df INTERSECT SELECT b FROM df")

I get the error:

OptimizationException: Internal("Optimizer rule 'projection_push_down' failed, due to generate a different schema, original schema: DFSchema { fields: [DFField { qualifier: Some(\"df\"), field: Field { name: \"a\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {} }, new schema: DFSchema { fields: [DFField { qualifier: Some(\"df\"), field: Field { name: \"a\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: Some(\"df\"), field: Field { name: \"b\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {} }")

Anything else we need to know?:
Can be worked around by removing projection_push_down from optimizers.rs but this can severely impact performance.

Environment:

  • dask-sql version: 2022.12.0+39.g5a3ee4c
  • Python version: 3.9
  • Operating System: linux
  • Install method (conda, pip, source): happens with both conda and source builds
@ChrisJar ChrisJar added bug Something isn't working needs triage Awaiting triage by a dask-sql maintainer labels Jan 26, 2023
@ayushdg
Copy link
Collaborator

ayushdg commented Jan 26, 2023

The projection_push_down rule from datafusion is a crucial one for performance reasons and it's probably not worth excluding that rule from our list.

It feels like a bug on the datafusion side where if the column names for both sides of the intersect don't match, the rule fails. (Even if the number and type of columns on both sides are identical).

It might be worth checking datafusion and raising an equivalent issue there as well.

@ChrisJar
Copy link
Collaborator Author

Yep, changing the column names to match (SELECT a FROM df INTERSECT SELECT b AS a FROM df) seems to solve the issue. Was working a week or two ago so it must've been a recent change. I can look into it on the datafusion side. Also, it doesn't seem to be a problem with UNION so it might have something to do with INTERSECT using a LeftSemi join under the hood.

@ChrisJar
Copy link
Collaborator Author

ChrisJar commented Feb 1, 2023

@ayushdg I did some testing in Datafusion. It looks like this only fails in Datafusion 15 and is fixed in 16 and 17.

@ayushdg ayushdg added datafusion Related to work in DataFusion and removed needs triage Awaiting triage by a dask-sql maintainer labels Feb 1, 2023
@ayushdg
Copy link
Collaborator

ayushdg commented Feb 1, 2023

Thanks for investigating! In that case let's leave this open and mark it fixed by #998 when it lands.

@ChrisJar
Copy link
Collaborator Author

Closed with #998

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Related to work in DataFusion
Projects
None yet
Development

No branches or pull requests

2 participants