Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong results when parquet page index filtering is enabled #4002

Closed
alamb opened this issue Oct 28, 2022 · 1 comment · Fixed by #3967
Closed

Wrong results when parquet page index filtering is enabled #4002

alamb opened this issue Oct 28, 2022 · 1 comment · Fixed by #3967
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Oct 28, 2022

Describe the bug
When I enable page index filtering incorrect answers result

NOTE that page index filtering is not enabled by default (as we are still working on it) so this issue will not likely affect users:

To Reproduce

  1. Download data from repro.zip
  2. Run datafusion CLI:

Expected behavior
Same answer should be produced with and without page index filtering enabled. However, the answers are different

Without page index 15963 rows are produced

(arrow_dev) alamb@MacBook-Pro-8:~/Downloads$ DATAFUSION_EXECUTION_PARQUET_ENABLE_PAGE_INDEX=false datafusion-cli -f script.sql 
DataFusion CLI v13.0.0
0 rows in set. Query took 0.001 seconds.
+-------------------------------------------------+---------+
| name                                            | setting |
+-------------------------------------------------+---------+
| datafusion.execution.batch_size                 | 8192    |
| datafusion.execution.coalesce_batches           | true    |
| datafusion.execution.coalesce_target_batch_size | 4096    |
| datafusion.execution.parquet.enable_page_index  | false   |
| datafusion.execution.parquet.pushdown_filters   | false   |
| datafusion.execution.parquet.reorder_filters    | false   |
| datafusion.execution.time_zone                  | UTC     |
| datafusion.explain.logical_plan_only            | false   |
| datafusion.explain.physical_plan_only           | false   |
| datafusion.optimizer.filter_null_join_keys      | false   |
| datafusion.optimizer.max_passes                 | 3       |
| datafusion.optimizer.skip_failed_rules          | true    |
+-------------------------------------------------+---------+
12 rows in set. Query took 0.001 seconds.
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 53819           |
+-----------------+
1 row in set. Query took 0.002 seconds.
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 15963           |
+-----------------+
1 row in set. Query took 0.002 seconds.

WITH page filtering, 0 rows are produced 😱

(arrow_dev) alamb@MacBook-Pro-8:~/Downloads$ DATAFUSION_EXECUTION_PARQUET_ENABLE_PAGE_INDEX=true datafusion-cli -f script.sql 
DataFusion CLI v13.0.0
0 rows in set. Query took 0.001 seconds.
+-------------------------------------------------+---------+
| name                                            | setting |
+-------------------------------------------------+---------+
| datafusion.execution.batch_size                 | 8192    |
| datafusion.execution.coalesce_batches           | true    |
| datafusion.execution.coalesce_target_batch_size | 4096    |
| datafusion.execution.parquet.enable_page_index  | true    |
| datafusion.execution.parquet.pushdown_filters   | false   |
| datafusion.execution.parquet.reorder_filters    | false   |
| datafusion.execution.time_zone                  | UTC     |
| datafusion.explain.logical_plan_only            | false   |
| datafusion.explain.physical_plan_only           | false   |
| datafusion.optimizer.filter_null_join_keys      | false   |
| datafusion.optimizer.max_passes                 | 3       |
| datafusion.optimizer.skip_failed_rules          | true    |
+-------------------------------------------------+---------+
12 rows in set. Query took 0.001 seconds.
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 53819           |
+-----------------+
1 row in set. Query took 0.002 seconds.
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 0               |
+-----------------+
1 row in set. Query took 0.002 seconds.

Additional context
I found this issue and reproducer while working on the integration test #3976

I suspect @Ted-Jiang is already working on this issue

@alamb alamb added the bug Something isn't working label Oct 28, 2022
@Ted-Jiang
Copy link
Member

@alamb find error in my local branch


[2022-10-29T15:51:31Z ERROR datafusion::physical_plan::file_format::parquet] 
Error evaluating page index predicate values Error during planning: 
Can not create statistics record batch: Invalid argument error:
   Column 'container_min' is declared as non-nullable but contains null values

Seems can not create batch in build_statistics_record_batch will NULL valuse.
Because container col:

container:           REQUIRED BINARY L:STRING R:0 D:0

now not support BINARY type which will return a null array now, will add a fallback .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants