Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalescing outer join panics (and/or loses) columns from right frame if join keys expressions have overlapping names #16289

Closed
2 tasks done
wence- opened this issue May 17, 2024 · 2 comments · Fixed by #16329
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@wence-
Copy link
Contributor

wence- commented May 17, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
left = pl.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5], "c": [5, 6, 7]})
right = pl.DataFrame({"a": [2, 3, 4], "c": [4, 5, 6]})
left.join(right, on=[pl.col("a")], how="outer_coalesce")
# shape: (4, 4)
# ┌─────┬──────┬──────┬─────────┐
# │ a   ┆ b    ┆ c    ┆ c_right │
# │ --- ┆ ---  ┆ ---  ┆ ---     │
# │ i64 ┆ i64  ┆ i64  ┆ i64     │
# ╞═════╪══════╪══════╪═════════╡
# │ 2   ┆ 4    ┆ 6    ┆ 4       │
# │ 3   ┆ 5    ┆ 7    ┆ 5       │
# │ 4   ┆ null ┆ null ┆ 6       │
# │ 1   ┆ 3    ┆ 5    ┆ null    │
# └─────┴──────┴──────┴─────────┘

# nonsensical, but ok
left.join(right, on=[pl.col("a"), pl.col("a")], how="outer_coalesce")
# shape: (4, 3)
# ┌─────┬──────┬──────┐
# │ a   ┆ b    ┆ c    │
# │ --- ┆ ---  ┆ ---  │
# │ i64 ┆ i64  ┆ i64  │
# ╞═════╪══════╪══════╡
# │ 2   ┆ 4    ┆ 6    │
# │ 3   ┆ 5    ┆ 7    │
# │ 4   ┆ null ┆ null │
# │ 1   ┆ 3    ┆ 5    │
# └─────┴──────┴──────┘

# even more
left.join(right, on=[pl.col("a"), pl.col("a"), pl.col("a")], how="outer_coalesce")
# thread '<unnamed>' panicked at crates/polars-ops/src/frame/join/general.rs:90:25:
# removal index (is 3) should be < len (is 3)

Log output

run JoinExec
join parallel: true
OUTER join dataframes finished
run JoinExec
join parallel: true
OUTER join dataframes finished
run JoinExec
join parallel: true

Issue description

Looks like coalescing outer join just attempts to eat as many columns from the right dataframe as there are key columns in the join.

Expected behavior

I would expect all three of these (the latter two being odd) mathematically equivalent join expressions to give me the same result.

Or, complain that we're going to produce overlapping output key names.

Installed versions

--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Python:               3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  0.11.0
cloudpickle:          3.0.0
connectorx:           0.3.3
deltalake:            0.17.4
fastexcel:            0.10.4
fsspec:               2024.3.1
gevent:               24.2.1
hvplot:               0.10.0
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.2.2
pyarrow:              16.0.0
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               1.0.10
sqlalchemy:           2.0.30
torch:                2.3.0.post300
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0
@wence- wence- added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 17, 2024
@ritchie46
Copy link
Member

Does this happen if we don't join on twice the same name? We should raise as it doesn't make sense to join on duplicate columns.

@wence-
Copy link
Contributor Author

wence- commented May 21, 2024

Thanks!

wence- added a commit to wence-/polars that referenced this issue May 28, 2024
The fix for pola-rs#16289 checked for expression identity when validating the
join keys, but if multiple expressions are not identical, they may
still produce matching output key names. Since this is ambiguous,
catch this more general case and raise.

- Fixes pola-rs#16547
wence- added a commit to wence-/polars that referenced this issue May 28, 2024
The fix for pola-rs#16289 checked for expression identity when validating the
join keys, but if multiple expressions are not identical, they may
still produce matching output key names. Since this is ambiguous,
catch this more general case and raise.

- Fixes pola-rs#16547
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants