-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement basic COALESCE functionality #823
Changes from 6 commits
80d1a11
704e62e
497acb5
e44badb
c9e544b
cd1a147
2000932
9a76366
22cf6ad
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -355,6 +355,43 @@ def test_null(c): | |
assert_eq(df, expected_df) | ||
|
||
|
||
@pytest.mark.parametrize("gpu", [False, pytest.param(True, marks=pytest.mark.gpu)]) | ||
def test_coalesce(c, gpu): | ||
df = dd.from_pandas(pd.DataFrame({"a": [1], "b": [np.nan]}), npartitions=1) | ||
c.create_table("df", df, gpu=gpu) | ||
|
||
df = c.sql( | ||
""" | ||
SELECT | ||
COALESCE(3, 5) as c1, | ||
COALESCE(NULL, NULL) as c2, | ||
COALESCE(NULL, 'hi') as c3, | ||
COALESCE(NULL, NULL, 'bye', 5/0) as c4, | ||
COALESCE(NULL, 3/2, NULL, 'fly') as c5, | ||
COALESCE(SUM(b), 'why', 2.2) as c6, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ayushdg @charlesbluca This line is now failing on gpu with the newest dask-sql environment. Specifically,
throws:
This doesn't fail on CPU nor does it fail on the roughly equivalent query
Any Idea what might be happening? Could this be due to a change to cudf? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taking a look right now There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you list the cuDF conda packages you're using (assuming this is using conda packages and not source)? I pulled in the latest 22.12 nightlies and wasn't able to reproduce:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep here they are:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And here's my full environment
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I still don't seem to be able to reproduce; the only difference in my environment is that it's missing Why don't we pull in the latest changes to see if these failures crop up in gpuCI |
||
COALESCE(NULL, MEAN(b), MEAN(a), 4/0) as c7 | ||
FROM df | ||
""" | ||
) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Following up the previous comment, it might be good to add tests using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With 2000932 in, I am now trying to add something like import numpy as np
import pandas as pd
from dask_sql import Context
c = Context()
c.create_table("df", pd.DataFrame({
"a": [np.nan, 1, np.nan],
"b": [np.nan, np.nan, 2]
}))
c.sql("""
select
coalesce(a, b) as c,
coalesce(sum(b), 'why') as d
from df
""")
# ParsingException: Plan("Projection references non-aggregate values: Expression df.a could not be resolved from available columns: SUM(df.b)") cc @andygrove |
||
|
||
expected_df = pd.DataFrame( | ||
{ | ||
"c1": [3], | ||
"c2": [np.nan], | ||
"c3": ["hi"], | ||
"c4": ["bye"], | ||
"c5": ["1"], | ||
"c6": ["why"], | ||
"c7": [1.0], | ||
} | ||
) | ||
|
||
df["c2"] = df["c2"].astype("float64") | ||
df["c5"] = df["c5"].astype("O") | ||
assert_eq(df, expected_df) | ||
c.drop_table("df") | ||
|
||
|
||
def test_boolean_operations(c): | ||
df = dd.from_pandas(pd.DataFrame({"b": [1, 0, -1]}), npartitions=1) | ||
df["b"] = df["b"].apply( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,11 +37,8 @@ | |
69, | ||
70, | ||
72, | ||
75, | ||
77, | ||
78, | ||
80, | ||
84, | ||
86, | ||
87, | ||
88, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we throw in a comment here giving the reasoning for this override (IIUC differences in nullable "object" columns between cuDF and pandas)?