Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix concat #673

Open
AndreiKingsley opened this issue Apr 23, 2024 · 11 comments
Open

Fix concat #673

AndreiKingsley opened this issue Apr 23, 2024 · 11 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@AndreiKingsley
Copy link
Collaborator

concat removes key column entirely (name and values)

image image
@AndreiKingsley AndreiKingsley added the bug Something isn't working label Apr 23, 2024
@AndreiKingsley
Copy link
Collaborator Author

Maybe it's useful to add GroupBy.origin: DataFrame? that returns original dataframe if it was created via DataFrame.groupBy()

@zaleslaw zaleslaw added this to the 0.14.0 milestone Apr 23, 2024
@koperagen koperagen self-assigned this Apr 23, 2024
@Jolanrensen
Copy link
Collaborator

I think this is the intended behavior. The key of the group is something temporary and usually consists of columns already in the DF.
We are working on a way to access the group keys from aggregate though (#662), maybe that can be a nice alternative.

The original DataFrame can be retrieved using concat (albeit with a different order perhaps).

@AndreiKingsley
Copy link
Collaborator Author

Ok, anyway new concat is needed for the purpose I described.

@AndreiKingsley
Copy link
Collaborator Author

@Jolanrensen
Copy link
Collaborator

maybe a concatWithKeys() would be a nice addition?

@koperagen
Copy link
Collaborator

I think it won't hurt to make do it by default. One might say that df.groupBy { expr { } } is a shortcut for df.add() { }.groupBy {}

@Jolanrensen
Copy link
Collaborator

if we do it by default, then we would get duplicate columns, because the key columns are often in the groups as well

@koperagen
Copy link
Collaborator

koperagen commented Apr 24, 2024

Andrey's implementation only adds "new" columns (or so i understood)

@Jolanrensen
Copy link
Collaborator

But then, what qualifies as "new"?

  • groupBy { expr { myCol } }, yes
  • but groupBy { myCol + 1 }?
  • or groupBy { myCol named "other" }

I think we should be careful here

@Jolanrensen
Copy link
Collaborator

There's also the case where a user creates a new expr column with a duplicate name that should still be kept, so my suggestion is the following:
Create a concatWithKeys() that will add all key columns to the front of the groups regardless of whether they were in the DF already. Avoid naming clashes by using the ColumnNameGenerator, for instance with DynamicDataFrameBuilder.

Something like:

internal fun GroupBy<*, *>.concatWithKeys(): DataFrame<*> =
    mapToFrames {
        DynamicDataFrameBuilder()
            .apply {
                for (column in group.columns()) {
                    add(column)
                }
                val rowsCount = group.rowsCount()
                for ((name, value) in key.toMap()) {
                    add(List(rowsCount) { value }.toColumn(name))
                }
            }
            .toDataFrame()
            .moveToLeft { takeLast(key.count()) }
    }.concat()

@Jolanrensen
Copy link
Collaborator

Alternatively, what's arguably a lot simpler, we could just explode the groups column. Like:

internal fun GroupBy<*, *>.concatWithKeys(): DataFrame<*> =
    toDataFrame().explode { groups }

This will generate extra key values where necessary and keep the grouped columns in a column group, avoiding any potential name clashes :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants