-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concatenate dictionary of objects along axis=0 #15648
base: branch-24.06
Are you sure you want to change the base?
Concatenate dictionary of objects along axis=0 #15648
Conversation
Original change here rapidsai#3188 Why were we casting to "float64" in the old testcase? Maybe related to this comment? rapidsai#3188 (comment)
385a58c
to
05d5ef2
Compare
this is unfortunate: 0241149 some side effect behavior of attempting to implement {
"first": (
cudf.DataFrame,
{"data": {(1, 2): [1, 2], (3, 4): [3, 4]}},
),
"second": (
cudf.DataFrame,
{"data": {(1, 2): [5, 6], (5, 6): [7, 8]}},
),
}, |
python/cudf/cudf/core/reshape.py
Outdated
result.index = cudf.MultiIndex.from_tuples( | ||
[ | ||
(k, i) | ||
for k, x in zip(keys, [obj.shape[0] for obj in objs]) | ||
for i in range(x) | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pattern is not right, consider
In [39]: x = {
...: "first": (pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}}, index=["a", "b"])),
...: "second": (pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}}, index=[1, 2])),
...: "third": (pd.DataFrame(**{"data": {"C": [1, 2, 3]}}, index=["d", "g", "h"])),
...: }
In [40]: pd.concat(x, axis=0)
Out[40]:
A B C
first a 1.0 3.0 NaN
b 2.0 4.0 NaN
second 1 5.0 7.0 NaN
2 6.0 8.0 NaN
third d NaN NaN 1.0
g NaN NaN 2.0
h NaN NaN 3.0
In other words, the existing index values are concatenated, and then a new "tiled" level is added.
I think we can do this with cudf.MultiIndex.from_arrays
and usage of some libcudf primitives:
import cudf
import cudf._lib as libcudf
from cudf.core.column import as_column, concat_columns
(new_level,) = libcudf.filling.repeat([as_column(keys)], as_column(list(map(len, objs))))
existing_levels = objs[0].index.nlevel
if not all(obj.index.nlevel == existing_levels for obj in objs):
raise AssertionError("Cannot concat when indices do not have same number of levels")
# Might need to do some dtype checking/coercion here
existing = [concat_columns([obj.get_level_values(level) for obj in objs]) for level in range(existing_levels)]
new_index = cudf.MultiIndex.from_arrays([new_level, *existing])
Only partly tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence- if the following is not correct, can you please provide an example of what would be correct?
In [39]: x = {
...: "first": (pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}}, index=["a", "b"])),
...: "second": (pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}}, index=["c", "e"])),
...: "third": (pd.DataFrame(**{"data": {"C": [1, 2]}}, index=["d", "g"])),
...: }
In [40]: pd.concat(x, axis=0)
Out[40]:
A B C
first a 1.0 3.0 NaN
b 2.0 4.0 NaN
second c 5.0 7.0 NaN
e 6.0 8.0 NaN
third d NaN NaN 1.0
g NaN NaN 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pandas code looks right, but I think the iteration over range(x)
in the patch ignores the existing index and would instead just produce some concatenated range indices?
For the example shown, I think we have:
keys = ["first", "second", "third"]
and
objs = [pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}}, index=["a", "b"]),
pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}}, index=["c", "e"]),
pd.DataFrame(**{"data": {"C": [1, 2]}}, index=["d", "g"])]
So
[
(k, i)
for k, x in zip(keys, [obj.shape[0] for obj in objs])
for i in range(x)
]
Produces:
[('first', 0),
('first', 1),
('second', 0),
('second', 1),
('third', 0),
('third', 1),
('third', 2)]
So the first level is correct, but the inner level is wrong (it should be "a", "b", "c", "e", "d", "g"
).
Instead, if we build the level indices level-by-level as suggested above, we can produce the correct inner level.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence- confirming -- by inner level you mean index [1]
within each tuple
, so instead of:
[('first', 0),
('first', 1),
('second', 0),
('second', 1),
('third', 0),
('third', 1),
('third', 2)]
we want:
[('first', "a"),
('first', "b"),
('second', "c"),
('second', "e"),
('third', ??),
('third', "d"), ??
('third', "g")] ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what i'm seeing from pandas, which makes sense to me:
>>> x = {
... "first": (pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}, "index":["a", "b"]})),
... "second": (pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}, "index":["c", "e"]})),
... "third": (pd.DataFrame(**{"data": {"C": [1, 2]}, "index":["d", "g"]})),
... }
>>>
>>> z= pd.concat(x, axis=0)
>>> z
A B C
first a 1.0 3.0 NaN
b 2.0 4.0 NaN
second c 5.0 7.0 NaN
e 6.0 8.0 NaN
third d NaN NaN 1.0
g NaN NaN 2.0
>>> z.index
MultiIndex([( 'first', 'a'),
( 'first', 'b'),
('second', 'c'),
('second', 'e'),
( 'third', 'd'),
( 'third', 'g')],
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence- i've updated the code to produce the same MultiIndex
index as pd
. i think it's simpler than the route you were suggesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the axis=0
and tuple
columns are still an issue, so i still need to sort that out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see. If we can, I'd like to keep the construction of the new multiindex levels on device rather than coming back to the host and iterating over things.
Consider the case where each dataframe has a few million rows, it would be nice not to construct these tuples individually and then go back to the GPU.
That's what the more complicated code I sketched is trying to do:
libcudf.filling.repeat
is like numpy.repeat
but for cudf columns rather than numpy arrays. And concat_columns
is the low-level concatenation of cudf columns (which is what we need for the "inner" levels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, i see. that makes sense. thanks will adjust
…:er-eis/cudf into er-eis/allow-concat-on-frame-dict-axis-0
80dae7e
to
15dfe11
Compare
Description
Closes #15647
Related to #15115.
Unlike
pandas.concat
,cudf.concat
withaxis=0
doesn't work with a dictionary of objects. The following code raises an error.This commit resolves this issue.
See here for context: #15115 (comment)
importantly:
As a newcomer to the repo, it'd be great if I could get some guidance if the initial solution is inefficient
Checklist