Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: write_csv result has incorrect formatting #4876

Closed
DDtKey opened this issue Jan 11, 2023 · 8 comments
Closed

Regression: write_csv result has incorrect formatting #4876

DDtKey opened this issue Jan 11, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@DDtKey
Copy link
Contributor

DDtKey commented Jan 11, 2023

Describe the bug
write_csv result contains unexpected format for EXTRACT(YEAR FROM ...). It looks like floating number for some reason.

To Reproduce
Example of file:

name,created_at,last_report
Sales,1825-08-29T07:29:01.256,2022-08-29
Marketing,2017-02-16T07:29:01.256,2022-02-16
IT,2019-04-04T07:29:01.256,2021-04-04
Finance,2016-09-14T07:29:01.256,2021-09-14
HR,2017-03-01T07:29:01.256,2022-03-01

SQL:

SELECT d.name, EXTRACT(YEAR FROM d.created_at) as year, d.last_report + INTERVAL '12' MONTH as deadline FROM deps d ORDER BY d.created_at

It returns:

name,year,deadline
Sales,1825.0,2023-08-29
Finance,2016.0,2022-09-14
Marketing,2017.0,2023-02-16
HR,2017.0,2023-03-01
IT,2019.0,2022-04-04

So result of EXTRACT(YEAR FROM d.created_at) as year has floating format for some reason.

While data_frame.show() works expected:

+-----------+------+------------+
| name      | year | deadline   |
+-----------+------+------------+
| Sales     | 1825 | 2023-08-29 |
| Finance   | 2016 | 2022-09-14 |
| Marketing | 2017 | 2023-02-16 |
| HR        | 2017 | 2023-03-01 |
| IT        | 2019 | 2022-04-04 |
+-----------+------+------------+

Expected behavior
Result should be consistent with show and previous version datafusion 15.0.0 (it used to work)

Additional context
Add any other context about the problem here.

@DDtKey DDtKey added the bug Something isn't working label Jan 11, 2023
@DDtKey DDtKey mentioned this issue Jan 11, 2023
4 tasks
@DDtKey DDtKey changed the title Regression: write_csv result has incorrect format for EXTRACT(YEAR FROM ...) Regression: write_csv result has incorrect formatting Jan 11, 2023
@alamb
Copy link
Contributor

alamb commented Jan 11, 2023

(arrow_dev) alamb@MacBook-Pro-8:~/Software/influxdb_iox2$ cat /tmp/data.csv 
name,created_at,last_report
Sales,1825-08-29T07:29:01.256,2022-08-29
Marketing,2017-02-16T07:29:01.256,2022-02-16
IT,2019-04-04T07:29:01.256,2021-04-04
Finance,2016-09-14T07:29:01.256,2021-09-14
HR,2017-03-01T07:29:01.256,2022-03-01

You are correct that extract is now a float -- I think that was on purpose to match posgres: #3997

(arrow_dev) alamb@MacBook-Pro-8:~/Software/influxdb_iox2$ datafusion-cli 
DataFusion CLI v16.0.0
❯ select * from '/tmp/data.csv';
+-----------+-------------------------+-------------+
| name      | created_at              | last_report |
+-----------+-------------------------+-------------+
| Sales     | 1825-08-29T07:29:01.256 | 2022-08-29  |
| Marketing | 2017-02-16T07:29:01.256 | 2022-02-16  |
| IT        | 2019-04-04T07:29:01.256 | 2021-04-04  |
| Finance   | 2016-09-14T07:29:01.256 | 2021-09-14  |
| HR        | 2017-03-01T07:29:01.256 | 2022-03-01  |
+-----------+-------------------------+-------------+
5 rows in set. Query took 0.020 seconds.
❯ SELECT d.name, arrow_typeof(EXTRACT(YEAR FROM d.created_at)) as year, d.last_report + INTERVAL '12' MONTH as deadline FROM '/tmp/data.csv' d ORDER BY d.created_at;
+-----------+---------+------------+
| name      | year    | deadline   |
+-----------+---------+------------+
| Sales     | Float64 | 2023-08-29 |
| Finance   | Float64 | 2022-09-14 |
| Marketing | Float64 | 2023-02-16 |
| HR        | Float64 | 2023-03-01 |
| IT        | Float64 | 2022-04-04 |
+-----------+---------+------------+
❯ 

Perhaps you can add a workaround by explicitly casting to numeric:

❯ SELECT d.name, EXTRACT(YEAR FROM d.created_at)::int as year, d.last_report + INTERVAL '12' MONTH as deadline FROM '/tmp/data.csv' d ORDER BY d.created_at;

@DDtKey
Copy link
Contributor Author

DDtKey commented Jan 11, 2023

Thanks for clarifying @alamb!

But I have concern about difference between show and write_csv. A bit misleading.

Why does it have different formatting rules? I just expected that show just merge content and prints it.

But in general, we can close this issue I guess

@alamb
Copy link
Contributor

alamb commented Jan 11, 2023

The different formatting is due to differences in the upstream arrow crate -- I have filed apache/arrow-rs#3513 to track

@alamb
Copy link
Contributor

alamb commented Jan 11, 2023

But in general, we can close this issue I guess

I will do so -- thank you very much for this report DDtKey - keep them coming!

@alamb alamb closed this as completed Jan 11, 2023
@DDtKey
Copy link
Contributor Author

DDtKey commented Jan 16, 2023

@alamb btw, just wanted to mention, that postgres returns exactly 2023 for SELECT EXTRACT(YEAR FROM now());, but with numeric type (SELECT pg_typeof(EXTRACT(YEAR FROM now()));)

I just meant that it could lead to misunderstanding why the same query before returned expected results and now it's float for year, so I hope issue with formatting in arrow gonna negotiate this one

@alamb
Copy link
Contributor

alamb commented Jan 17, 2023

postgres=# SELECT pg_typeof(EXTRACT(YEAR FROM now()));
 pg_typeof 
-----------
 numeric
(1 row)

Indeed -- I think we switched extract to return float but maybe it should be numeric? I can't remember the details -- perhaps @waitingkuo does?

@waitingkuo
Copy link
Contributor

Postgrseql's extract returns double precision (f64 for datafusion) and
date_part return numeric (decimal for datafusion)

quoted from https://www.postgresql.org/docs/15/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
For historical reasons, the date_part function returns values of type double precision. This can result in a loss of precision in certain uses. Using extract is recommended instead.

the return type for extract is still float but not decimal yet, it's tracked here
#3996

It's pending since the Decimal type wasn't consistent yet. #3996 (comment)
Is it recommended to change the return type now? @alamb @tustvold

@alamb
Copy link
Contributor

alamb commented Jan 17, 2023

Is it recommended to change the return type now? @alamb @tustvold

I think it would be interesting to try out. I am not sure what else is left related to "experimental" support for Decimal -- it seems pretty full featured now to me, given the work from @liukun4515 and others

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants