Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing null counts for struct type columns in the struct stats #714

Merged
merged 7 commits into from
Aug 8, 2022
Merged

Fix parsing null counts for struct type columns in the struct stats #714

merged 7 commits into from
Aug 8, 2022

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Jul 26, 2022

Description

When reading the struct column stats (as opposed to json column stats) the null counts where ignored for struct type columns. Completely ignoring is probably less bad than parsing them wrongly but if the user was filtering on a struct column this may have impacted performance. It was also spamming error logs:

[2022-07-26T19:33:52Z ERROR deltalake::action] Expect type of stats_parsed.nullRecords value to be struct, got: {integer: 0, null: 1, boolean: 0, double: 0, decimal: 0, string: 0, binary: 0, date: 0, timestamp: 0, struct: {struct_element: 0}, map: 0, array: 0, nested_struct: {struct_element: {nested_struct_element: 0}}, struct_of_array_of_map: {struct_element: 0}}

I probably should have done this as part of #656. Sorry for introducing this slight regression. I think some logging was getting suppressed during the unittest so I didn't notice it was an issue.

Related Issue(s)

Relates to #653 but that was for the most part already solved by #656

Changes

  • In the test assert that all stats are the same not just the min_values stats. This prevents a similar mistake in future.
  • Updated logic to handle struct types when parsing null_counts from struct stats.

@Tom-Newton Tom-Newton changed the title Parse null counts for struct types in struct stats Fix parsing null counts for struct types in struct stats Jul 26, 2022
@Tom-Newton Tom-Newton changed the title Fix parsing null counts for struct types in struct stats Fix parsing null counts for struct type columns in the struct stats Jul 26, 2022
rust/src/action.rs Outdated Show resolved Hide resolved
@Tom-Newton
Copy link
Contributor Author

Unfortunately the PyArrow 4.0.0 build is failing

-- Could NOT find Arrow (missing: Arrow_DIR)
      -- Checking for module 'arrow'
      --   No package 'arrow' found
      CMake Error at /opt/_internal/pipx/venvs/cmake/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
        Could NOT find Arrow (missing: ARROW_INCLUDE_DIR ARROW_LIB_DIR
        ARROW_FULL_SO_VERSION ARROW_SO_VERSION)
      Call Stack (most recent call first):
        /opt/_internal/pipx/venvs/cmake/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
        cmake_modules/FindArrow.cmake:450 (find_package_handle_standard_args)
        cmake_modules/FindArrowPython.cmake:46 (find_package)
        CMakeLists.txt:236 (find_package)

https://github.com/delta-io/delta-rs/runs/7705681944?check_suite_focus=true
Though I think this is probably not caused by my changes?

@houqp
Copy link
Member

houqp commented Aug 7, 2022

@Tom-Newton it's not related to your change, but caused by the new arrow 9 release, see #724 (comment). You can ignore it for now and we can handle the fix in a different PR.

@houqp houqp requested review from fvaleye and houqp August 7, 2022 07:49
@roeap roeap merged commit bc0b2ad into delta-io:main Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants