Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenate dictionary of objects along axis=0 #15648

Draft
wants to merge 18 commits into
base: branch-24.06
Choose a base branch
from

Conversation

er-eis
Copy link
Contributor

@er-eis er-eis commented May 4, 2024

Description

Closes #15647
Related to #15115.

Unlike pandas.concat, cudf.concat with axis=0 doesn't work with a dictionary of objects. The following code raises an error.

d = {
    'first': cudf.DataFrame({'A': [1, 2], 'B': [3, 4]}),
    'second': cudf.DataFrame({'A': [5, 6], 'B': [7, 8]}),
}

cudf.concat(d, axis=0)

This commit resolves this issue.

See here for context: #15115 (comment)
importantly:

... there could be some potential performance issues when concatenating along axis=0. For instance, calling MultiIndex.from_product(...) on very large inputs will be extremely slow, because cudf actually uses pandas for that operation.

As a newcomer to the repo, it'd be great if I could get some guidance if the initial solution is inefficient

Checklist

Copy link

copy-pr-bot bot commented May 4, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cuDF (Python) Affects Python cuDF API. label May 4, 2024
@er-eis er-eis force-pushed the er-eis/allow-concat-on-frame-dict-axis-0 branch from 385a58c to 05d5ef2 Compare May 4, 2024 04:09
@er-eis
Copy link
Contributor Author

er-eis commented May 4, 2024

this is unfortunate: 0241149

some side effect behavior of attempting to implement axis=0 coupled with concatenating dataframes with tuple columns causes issues. trying to sort out what the underlying issue is.

        {
            "first": (
                cudf.DataFrame,
                {"data": {(1, 2): [1, 2], (3, 4): [3, 4]}},
            ),
            "second": (
                cudf.DataFrame,
                {"data": {(1, 2): [5, 6], (5, 6): [7, 8]}},
            ),
        },

Comment on lines 512 to 518
result.index = cudf.MultiIndex.from_tuples(
[
(k, i)
for k, x in zip(keys, [obj.shape[0] for obj in objs])
for i in range(x)
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pattern is not right, consider

In [39]:         x = {
    ...:             "first": (pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}}, index=["a", "b"])),
    ...:             "second": (pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}}, index=[1, 2])),
    ...:             "third": (pd.DataFrame(**{"data": {"C": [1, 2, 3]}}, index=["d", "g", "h"])),
    ...:         }

In [40]: pd.concat(x, axis=0)
Out[40]: 
            A    B    C
first  a  1.0  3.0  NaN
       b  2.0  4.0  NaN
second 1  5.0  7.0  NaN
       2  6.0  8.0  NaN
third  d  NaN  NaN  1.0
       g  NaN  NaN  2.0
       h  NaN  NaN  3.0

In other words, the existing index values are concatenated, and then a new "tiled" level is added.

I think we can do this with cudf.MultiIndex.from_arrays and usage of some libcudf primitives:

import cudf
import cudf._lib as libcudf
from cudf.core.column import as_column, concat_columns

(new_level,) = libcudf.filling.repeat([as_column(keys)], as_column(list(map(len, objs))))

existing_levels = objs[0].index.nlevel
if not all(obj.index.nlevel == existing_levels for obj in objs):
    raise AssertionError("Cannot concat when indices do not have same number of levels")

# Might need to do some dtype checking/coercion here
existing = [concat_columns([obj.get_level_values(level) for obj in objs]) for level in range(existing_levels)]

new_index = cudf.MultiIndex.from_arrays([new_level, *existing])

Only partly tested.

Copy link
Contributor Author

@er-eis er-eis May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- if the following is not correct, can you please provide an example of what would be correct?

In [39]:         x = {
    ...:             "first": (pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}}, index=["a", "b"])),
    ...:             "second": (pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}}, index=["c", "e"])),
    ...:             "third": (pd.DataFrame(**{"data": {"C": [1, 2]}}, index=["d", "g"])),
    ...:         }

In [40]: pd.concat(x, axis=0)
Out[40]: 
            A    B    C
first  a  1.0  3.0  NaN
       b  2.0  4.0  NaN
second c  5.0  7.0  NaN
       e  6.0  8.0  NaN
third  d  NaN  NaN  1.0
       g  NaN  NaN  2.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pandas code looks right, but I think the iteration over range(x) in the patch ignores the existing index and would instead just produce some concatenated range indices?

For the example shown, I think we have:

keys = ["first", "second", "third"]

and

objs = [pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}}, index=["a", "b"]),
        pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}}, index=["c", "e"]),
        pd.DataFrame(**{"data": {"C": [1, 2]}}, index=["d", "g"])]

So

                    [
                        (k, i)
                        for k, x in zip(keys, [obj.shape[0] for obj in objs])
                        for i in range(x)
                    ]

Produces:

[('first', 0),
 ('first', 1),
 ('second', 0),
 ('second', 1),
 ('third', 0),
 ('third', 1),
 ('third', 2)]

So the first level is correct, but the inner level is wrong (it should be "a", "b", "c", "e", "d", "g").

Instead, if we build the level indices level-by-level as suggested above, we can produce the correct inner level.

Does that make sense?

Copy link
Contributor Author

@er-eis er-eis May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- confirming -- by inner level you mean index [1] within each tuple, so instead of:

[('first', 0),
 ('first', 1),
 ('second', 0),
 ('second', 1),
 ('third', 0),
 ('third', 1),
 ('third', 2)]

we want:

[('first', "a"),
 ('first', "b"),
 ('second', "c"),
 ('second', "e"),
 ('third', ??),
 ('third', "d"), ??
 ('third', "g")] ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what i'm seeing from pandas, which makes sense to me:

>>> x = {
...     "first": (pd.DataFrame(**{"data": {"A": [1, 2], "B": [3, 4]}, "index":["a", "b"]})),
...     "second": (pd.DataFrame(**{"data": {"A": [5, 6], "B": [7, 8]}, "index":["c", "e"]})),
...     "third": (pd.DataFrame(**{"data": {"C": [1, 2]}, "index":["d", "g"]})),
... }
>>> 
>>> z= pd.concat(x, axis=0)
>>> z
            A    B    C
first  a  1.0  3.0  NaN
       b  2.0  4.0  NaN
second c  5.0  7.0  NaN
       e  6.0  8.0  NaN
third  d  NaN  NaN  1.0
       g  NaN  NaN  2.0
>>> z.index
MultiIndex([( 'first', 'a'),
            ( 'first', 'b'),
            ('second', 'c'),
            ('second', 'e'),
            ( 'third', 'd'),
            ( 'third', 'g')],
           )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- i've updated the code to produce the same MultiIndex index as pd. i think it's simpler than the route you were suggesting.

Copy link
Contributor Author

@er-eis er-eis May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the axis=0 and tuple columns are still an issue, so i still need to sort that out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see. If we can, I'd like to keep the construction of the new multiindex levels on device rather than coming back to the host and iterating over things.

Consider the case where each dataframe has a few million rows, it would be nice not to construct these tuples individually and then go back to the GPU.

That's what the more complicated code I sketched is trying to do:

libcudf.filling.repeat is like numpy.repeat but for cudf columns rather than numpy arrays. And concat_columns is the low-level concatenation of cudf columns (which is what we need for the "inner" levels)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i see. that makes sense. thanks will adjust

@er-eis er-eis force-pushed the er-eis/allow-concat-on-frame-dict-axis-0 branch from 80dae7e to 15dfe11 Compare May 8, 2024 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuDF (Python) Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[FEA] Concatenate dictionary of objects along axis=0
2 participants