New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Decimal columns dropped from Parquet #318
Comments
Thanks for @kerinin for filing and providing steps to reproduce the bug. I am able to reproduce the bug. Here's some further investigation into the issue: I took the provided parquet file and ran the same query. The schema returned aligns with potentially a problem in prepare. Furthermore, the prepared files appears to drop the column as well. I've attached the prepared version of the parquet file provided. prepared_table.parquet.zip I'll continue looking at the issue. |
Unfortunately, the results are due to our lack of support for the Decimal datatype. During data preparation, we filter and drop columns that are not supported by the system. The local run of the preparation resulted in the following logs: 2023-05-04T08-52-14-engine-stdout.log. If we do not support the column, we log as an INFO that the column is dropped:
Running pqrs on the provided parquet file, we get the following schema:
It appears that the output from pandas resulted in the decimal column. Unfortunately, we do not support decimal columns (we should file a feature request for this). Suggestion: We should also be more transparent about the fact a column was dropped in preparation as well. |
In the immediate-term, could we change from silently dropping the column to returning an error on data load? We could suggest converting the decimal to a float in the error message. Silently dropping data isn't a great user experience. |
From the old repo:
Ben said:
Potentially relevant: |
One of the main difficulties here is determining the output precision. Many operations have different rules. This would likely need to be part of the type system, or allow arithmetic to have special rules for operating on decimal types. We should do this, but it isn't a small amount of work. One idea for a "quick fix" which we could do (and file a feature request for full support) would be to treat the
This would allow us to support decimal columns for now, with explicit conversion to supported types, and introduce math operations later. If we got this route, we would likely also want to have a useful error message for things like We'd also need to figure out how to reflect the scale in the type -- possibly |
In the latest changes (#323), the flow now fails at file ingestion rather than implicitly dropping. The next steps are to display a more friendly error message rather than an INTERNAL ERROR. |
Description
When I create a table and load the attached Parquet file, the
transfer_value
column is not present in the loaded data.Here's the file's contents
This displays the file's contents - includes a final column named
transfer_value
Now, let's create a table for the data and load it.
This produces the following result (notice there's no
transfer_value
column):File (unzip first): nft_df.parquet.zip
The text was updated successfully, but these errors were encountered: