Concatenate dictionary of objects along axis=0 #15648

er-eis · 2024-05-04T00:57:08Z

Description

Closes #15647
Related to #15115.

Unlike pandas.concat, cudf.concat with axis=0 doesn't work with a dictionary of objects. The following code raises an error.

d = {
    'first': cudf.DataFrame({'A': [1, 2], 'B': [3, 4]}),
    'second': cudf.DataFrame({'A': [5, 6], 'B': [7, 8]}),
}

cudf.concat(d, axis=0)

This commit resolves this issue.

See here for context: #15115 (comment)
importantly:

... there could be some potential performance issues when concatenating along axis=0. For instance, calling MultiIndex.from_product(...) on very large inputs will be extremely slow, because cudf actually uses pandas for that operation.

As a newcomer to the repo, it'd be great if I could get some guidance if the initial solution is inefficient

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.

Original change here rapidsai#3188 Why were we casting to "float64" in the old testcase? Maybe related to this comment? rapidsai#3188 (comment)

copy-pr-bot · 2024-05-04T00:57:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…llow-concat-on-frame-dict-axis-0

er-eis · 2024-05-04T04:12:35Z

this is unfortunate: 0241149

some side effect behavior of attempting to implement axis=0 coupled with concatenating dataframes with tuple columns causes issues. trying to sort out what the underlying issue is.

        {
            "first": (
                cudf.DataFrame,
                {"data": {(1, 2): [1, 2], (3, 4): [3, 4]}},
            ),
            "second": (
                cudf.DataFrame,
                {"data": {(1, 2): [5, 6], (5, 6): [7, 8]}},
            ),
        },

wence- · 2024-05-07T16:36:23Z

python/cudf/cudf/core/reshape.py

+                result.index = cudf.MultiIndex.from_tuples(
+                    [
+                        (k, i)
+                        for k, x in zip(keys, [obj.shape[0] for obj in objs])
+                        for i in range(x)
+                    ]
+                )


I think this pattern is not right, consider

In [39]: x = { ...: "first": (pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}}, index=["a", "b"])), ...: "second": (pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}}, index=[1, 2])), ...: "third": (pd.DataFrame(**{"data": {"C": [1, 2, 3]}}, index=["d", "g", "h"])), ...: } In [40]: pd.concat(x, axis=0) Out[40]: A B C first a 1.0 3.0 NaN b 2.0 4.0 NaN second 1 5.0 7.0 NaN 2 6.0 8.0 NaN third d NaN NaN 1.0 g NaN NaN 2.0 h NaN NaN 3.0

In other words, the existing index values are concatenated, and then a new "tiled" level is added.

I think we can do this with cudf.MultiIndex.from_arrays and usage of some libcudf primitives:

import cudf import cudf._lib as libcudf from cudf.core.column import as_column, concat_columns (new_level,) = libcudf.filling.repeat([as_column(keys)], as_column(list(map(len, objs)))) existing_levels = objs[0].index.nlevel if not all(obj.index.nlevel == existing_levels for obj in objs): raise AssertionError("Cannot concat when indices do not have same number of levels") # Might need to do some dtype checking/coercion here existing = [concat_columns([obj.get_level_values(level) for obj in objs]) for level in range(existing_levels)] new_index = cudf.MultiIndex.from_arrays([new_level, *existing])

Only partly tested.

@wence- if the following is not correct, can you please provide an example of what would be correct?

In [39]: x = { ...: "first": (pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}}, index=["a", "b"])), ...: "second": (pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}}, index=["c", "e"])), ...: "third": (pd.DataFrame(**{"data": {"C": [1, 2]}}, index=["d", "g"])), ...: } In [40]: pd.concat(x, axis=0) Out[40]: A B C first a 1.0 3.0 NaN b 2.0 4.0 NaN second c 5.0 7.0 NaN e 6.0 8.0 NaN third d NaN NaN 1.0 g NaN NaN 2.0

The pandas code looks right, but I think the iteration over range(x) in the patch ignores the existing index and would instead just produce some concatenated range indices?

For the example shown, I think we have:

keys = ["first", "second", "third"]

and

objs = [pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}}, index=["a", "b"]), pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}}, index=["c", "e"]), pd.DataFrame(**{"data": {"C": [1, 2]}}, index=["d", "g"])]

So

[ (k, i) for k, x in zip(keys, [obj.shape[0] for obj in objs]) for i in range(x) ]

Produces:

[('first', 0), ('first', 1), ('second', 0), ('second', 1), ('third', 0), ('third', 1), ('third', 2)]

So the first level is correct, but the inner level is wrong (it should be "a", "b", "c", "e", "d", "g").

Instead, if we build the level indices level-by-level as suggested above, we can produce the correct inner level.

Does that make sense?

@wence- confirming -- by inner level you mean index [1] within each tuple, so instead of:

[('first', 0), ('first', 1), ('second', 0), ('second', 1), ('third', 0), ('third', 1), ('third', 2)]

we want:

[('first', "a"), ('first', "b"), ('second', "c"), ('second', "e"), ('third', ??), ('third', "d"), ?? ('third', "g")] ??

this is what i'm seeing from pandas, which makes sense to me:

>>> x = { ... "first": (pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}, "index":["a", "b"]})), ... "second": (pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}, "index":["c", "e"]})), ... "third": (pd.DataFrame(**{"data": {"C": [1, 2]}, "index":["d", "g"]})), ... } >>> >>> z= pd.concat(x, axis=0) >>> z A B C first a 1.0 3.0 NaN b 2.0 4.0 NaN second c 5.0 7.0 NaN e 6.0 8.0 NaN third d NaN NaN 1.0 g NaN NaN 2.0 >>> z.index MultiIndex([( 'first', 'a'), ( 'first', 'b'), ('second', 'c'), ('second', 'e'), ( 'third', 'd'), ( 'third', 'g')], )

@wence- i've updated the code to produce the same MultiIndex index as pd. i think it's simpler than the route you were suggesting.

the axis=0 and tuple columns are still an issue, so i still need to sort that out

Thanks, I see. If we can, I'd like to keep the construction of the new multiindex levels on device rather than coming back to the host and iterating over things.

Consider the case where each dataframe has a few million rows, it would be nice not to construct these tuples individually and then go back to the GPU.

That's what the more complicated code I sketched is trying to do:

libcudf.filling.repeat is like numpy.repeat but for cudf columns rather than numpy arrays. And concat_columns is the low-level concatenation of cudf columns (which is what we need for the "inner" levels)

oh, i see. that makes sense. thanks will adjust

…is-0

…:er-eis/cudf into er-eis/allow-concat-on-frame-dict-axis-0

…is-0

er-eis added 7 commits April 30, 2024 17:20

Work from amanlai

7e89e43

Tests

1a767fb

Remove extraneous testcase

9cebaa2

Fix some legacy tests

0736e4e

Original change here rapidsai#3188 Why were we casting to "float64" in the old testcase? Maybe related to this comment? rapidsai#3188 (comment)

Merge branch 'branch-24.06' into er-eis/allow-concat-on-frame-dict

f4c1e77

Concat dict axis 0

3d5a6dd

Remove extraneous file

5fab8f6

github-actions bot added the cuDF (Python) Affects Python cuDF API. label May 4, 2024

er-eis mentioned this pull request May 4, 2024

Concatenate dictionary of objects along axis=0 er-eis/cudf#1

Closed

3 tasks

er-eis added 4 commits May 3, 2024 21:09

Merge branch 'branch-24.06' of github.com:rapidsai/cudf into er-eis/a…

b7fb321

…llow-concat-on-frame-dict-axis-0

Fix merge

9c5fc52

Fix merge

3380872

Earlier check for multiple level types for axis=0

05d5ef2

er-eis force-pushed the er-eis/allow-concat-on-frame-dict-axis-0 branch from 385a58c to 05d5ef2 Compare May 4, 2024 04:09

Some progress on tuple columns with axis=0

0241149

wence- reviewed May 7, 2024

View reviewed changes

er-eis added 4 commits May 7, 2024 21:19

WIP

7bbaab1

Merge branch 'branch-24.06' into er-eis/allow-concat-on-frame-dict-ax…

ebd09ed

…is-0

Merge branch 'er-eis/allow-concat-on-frame-dict-axis-0' of github.com…

484eac5

…:er-eis/cudf into er-eis/allow-concat-on-frame-dict-axis-0

More accurate index creation

15dfe11

er-eis force-pushed the er-eis/allow-concat-on-frame-dict-axis-0 branch from 80dae7e to 15dfe11 Compare May 8, 2024 12:26

er-eis added 2 commits May 8, 2024 23:42

WIP

d6f669b

Merge branch 'branch-24.06' into er-eis/allow-concat-on-frame-dict-ax…

267d33e

…is-0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenate dictionary of objects along axis=0 #15648

Concatenate dictionary of objects along axis=0 #15648

er-eis commented May 4, 2024 •

edited

copy-pr-bot bot commented May 4, 2024

er-eis commented May 4, 2024

wence- May 7, 2024

er-eis May 7, 2024 •

edited

wence- May 8, 2024

er-eis May 8, 2024 •

edited

er-eis May 8, 2024

er-eis May 8, 2024

er-eis May 8, 2024 •

edited

wence- May 8, 2024

er-eis May 8, 2024

Concatenate dictionary of objects along axis=0 #15648

Are you sure you want to change the base?

Concatenate dictionary of objects along axis=0 #15648

Conversation

er-eis commented May 4, 2024 • edited

Description

Checklist

copy-pr-bot bot commented May 4, 2024

er-eis commented May 4, 2024

wence- May 7, 2024

Choose a reason for hiding this comment

er-eis May 7, 2024 • edited

Choose a reason for hiding this comment

wence- May 8, 2024

Choose a reason for hiding this comment

er-eis May 8, 2024 • edited

Choose a reason for hiding this comment

er-eis May 8, 2024

Choose a reason for hiding this comment

er-eis May 8, 2024

Choose a reason for hiding this comment

er-eis May 8, 2024 • edited

Choose a reason for hiding this comment

wence- May 8, 2024

Choose a reason for hiding this comment

er-eis May 8, 2024

Choose a reason for hiding this comment

er-eis commented May 4, 2024 •

edited

er-eis May 7, 2024 •

edited

er-eis May 8, 2024 •

edited

er-eis May 8, 2024 •

edited