Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot GROUP BY Binary #3050

Closed
waitingkuo opened this issue Aug 5, 2022 · 4 comments · Fixed by #3098
Closed

Cannot GROUP BY Binary #3050

waitingkuo opened this issue Aug 5, 2022 · 4 comments · Fixed by #3098
Labels
bug Something isn't working

Comments

@waitingkuo
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.

This is part of #3048

I was doing the benchmark for clickbench. One of it's column is binary, and the test query set contains group by that binary column. I got this error:

Internal error: Unsupported data type in hasher: Binary. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker"

To Reproduce
Steps to reproduce the behavior:

# Download data
wget https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_0.parquet

# Use Datafusion-CLI
➜  datafusion git:(datafusion) ✗ datafusion-cli
DataFusion CLI v10.0.0

# Create External Table 
❯ CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits_0.parquet';
0 rows in set. Query took 0.002 seconds.

# This query work
❯ SELECT "URL" FROM hits LIMIT 10;
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| URL                                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                  |
|                                                                                                                                                  |
|                                                                                                                                                  |
|                                                                                                                                                  |
| 687474703a2f2f686f6c6f64696c6e696b2e72752f7275737369612f30356a756c32303133266d6f64656c3d30                                                       |
| 687474703a2f2f6166697368612e6d61696c2e72752f636174616c6f672f3331342f776f6d656e2e72752f656e63793d312670616765332f3f6572726f7661742d70696e6e696b69 |
| 687474703a2f2f626f6e707269782e72752f696e6465782e72752f63696e656d612f6172742f3020393836203432342032333320d181d0b5d0b7d0bed0bd                     |
| 687474703a2f2f626f6e707269782e72752f696e6465782e72752f63696e656d612f6172742f4130303338372c33373937293b2072752926624c                             |
| 687474703a2f2f746f7572732f456b617465676f726979612532462673723d687474703a2f2f736c6f766172656e697965                                               |
|                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set. Query took 0.006 seconds.

# This one doesn't work
❯ SELECT "URL" FROM hits GROUP BY "URL" LIMIT 10;
ArrowError(ExternalError(Execution("Internal error: Unsupported data type in hasher: Binary. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker")))

10 rows in set. Query took 0.006 seconds.

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

@waitingkuo waitingkuo added the bug Something isn't working label Aug 5, 2022
@waitingkuo
Copy link
Contributor Author

waitingkuo commented Aug 10, 2022

looks like this only happens in mac, it works in linux environment

@Dandandan
Copy link
Contributor

That would be weird.

I think the error is expected, there is a missing match for the Binary datatype here. I expect adding the implementation here will solve that:
https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/hash_utils.rs#L606

@waitingkuo
Copy link
Contributor Author

@Dandandan you're right, i accident added the cast (`"URL":TEXT) and thought that it worked

@Dandandan
Copy link
Contributor

I am currently working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants