Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up on Clickbench benchmark #3048

Closed
15 tasks done
waitingkuo opened this issue Aug 5, 2022 · 6 comments
Closed
15 tasks done

Follow-up on Clickbench benchmark #3048

waitingkuo opened this issue Aug 5, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@waitingkuo
Copy link
Contributor

waitingkuo commented Aug 5, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

The goal is to pass all 43 cases to submit to https://benchmark.clickhouse.com/

i did some initial progress in #2902
We now passed 36 cases
This issue is to track the progress to pass all 43 cases.

  • 11 it crashes (fixed since clickbench has another parquet that marks those column as string)
  • 12 it crashes (same as 12)
  • 14 it crashes (same as 12)
  • 37
    • EventDate is uint in parquet, i can't add schema via create external table as parquet location 'xxx' to cast it to date; and then compare (solved by modifying queries.sql to cast in sql query)
  • 38 - same as 37
  • 39
    • same as 37
    • Binary cannot be compared with UTF-8 (fixed by casting binary to UTF-8)
  • 40 - same as 37
  • 41 - same as 37
  • 42 - same as 37
  • 43 - i need to fix 37-42 first then come back to this to see the reason

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@waitingkuo waitingkuo added the enhancement New feature or request label Aug 5, 2022
@waitingkuo
Copy link
Contributor Author

waitingkuo commented Aug 10, 2022

  1. passed

original query:

SELECT DATE_TRUNC('minute', EventTime) AS M, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-14' AND EventDate <= '2013-07-15' AND IsRefresh = 0 AND DontCountHits = 0 GROUP BY DATE_TRUNC('minute', EventTime) ORDER BY DATE_TRUNC('minute', EventTime) LIMIT 10 OFFSET 1000;

Since datafusion import columns as case-sensitive so we need to modify it as

SELECT DATE_TRUNC('minute', "EventTime") AS M, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate" >= '2013-07-14' AND "EventDate" <= '2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY DATE_TRUNC('minute', "EventTime") ORDER BY DATE_TRUNC('minute', "EventTime") LIMIT 10 OFFSET 1000;

EventTime is originally Int in parquet, we need to cast it (as parquet importer doesn't allow us to add schema for now)

SELECT DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) AS M, COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-14' AND "EventDate"::INT::DATE <= '2013-07-15' AND "IsRefresh" = 0 AND "DontCountHits" = 0 GROUP BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime")) ORDER BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime") LIMIT 10 OFFSET 1000;

The last isue is that we cannot ORCER BY BY DATE_TRUNC('minute', to_timestamp_seconds("EventTime")
i fixed it as ORCER BY BY DATE_TRUNC('minute', M)

@waitingkuo
Copy link
Contributor Author

Current issue for 11 (12 and 13 have the similar issue)

origin:

SELECT MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10;

add double quote as column name imported as case-sensitive

SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

Cast to TEXT as we cannot group by binary for now #3035

SELECT "MobilePhoneModel"::TEXT, COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel"::TEXT <> '' GROUP BY "MobilePhoneModel"::TEXT ORDER BY u DESC LIMIT 10;

Unfortunately it crashes. i feel like group by casted doesn't work here
i can reproduce similar issue by select count(distinct a) from (select '1' as a) b group by a::TEXT;. this query crashes in datafusion but works in postgresql

@waitingkuo
Copy link
Contributor Author

waitingkuo commented Aug 10, 2022

37-42 passed by

  1. Cast "EventTime" to Timestamp by to_timestamp_seconds
  2. Cast Binary to TEXT so that it could be compared with another UTF-8

@waitingkuo
Copy link
Contributor Author

waitingkuo commented Aug 10, 2022

#3050 Cannot group by binary only happens in mac, it works in linux i remove all the casting that cast from binary to text, and it works in linux now

finally all test cases passed
i submitted the pr and also listed some issues for future improvements: https://github.com/waitingkuo/ClickBench/blob/datafusion/datafusion/README.md

this is incorrect

@waitingkuo
Copy link
Contributor Author

i finally found the root cause for most of the issues..... their full-dataset and partial-dataset have different schema, submitted a issue there ClickHouse/ClickBench#18

@waitingkuo
Copy link
Contributor Author

close again :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant