Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeltaTable.to_pyarrow_dataset() fails for tables containing map types #713

Closed
Tom-Newton opened this issue Jul 26, 2022 · 6 comments
Closed
Labels
bug Something isn't working

Comments

@Tom-Newton
Copy link
Contributor

Environment

Delta-rs version: 0.5.8

Binding: Python

Environment:

  • Cloud provider: Azure
  • OS: Ubuntu 18.04
  • Other: Python 3.8

Bug

What happened:
When using DeltaTable.to_pyarrow_dataset() for a table containing map types it crashes with:

~/.cache/bazel/_bazel_tomnewton/8716cf88e62bedfa3a3dfe6a436af176/execroot/WayveCode/bazel-out/k8-opt/bin/tools/jupyter_services.runfiles/pip-svc_deltalake/deltalake/schema.py in pyarrow_field_from_dict(field)
    281     return pyarrow.field(
    282         field["name"],
--> 283         pyarrow_datatype_from_dict(field),
    284         field["nullable"],
    285         field.get("metadata"),

~/.cache/bazel/_bazel_tomnewton/8716cf88e62bedfa3a3dfe6a436af176/execroot/WayveCode/bazel-out/k8-opt/bin/tools/jupyter_services.runfiles/pip-svc_deltalake/deltalake/schema.py in pyarrow_datatype_from_dict(json_dict)
    270             return pyarrow.float64()
    271     else:
--> 272         return pyarrow.type_for_alias(type_class)
    273 
    274 

~/.cache/bazel/_bazel_tomnewton/8716cf88e62bedfa3a3dfe6a436af176/execroot/WayveCode/bazel-out/k8-opt/bin/tools/jupyter_services.runfiles/pip-svc_pyarrow/pyarrow/types.pxi in pyarrow.lib.type_for_alias()

ValueError: No type alias for map

What you expected to happen:
Open the table without error.

How to reproduce it:
Use to_pyarrow_dataset() on any table containing map types.
I created a test that catches this.

More details:
I think there are 2 reasons why this doesn't work currently:

  1. There is a bug in the python code to parse the schema json.
  2. The version of rust arrow used (15) does not support map types. After fixing point 1 we get ArrowException: C Data interface error: The datatype ""+m"" is still not supported in Rust implementation from this line. I'm unsure if this support is available in the latest version of rust arrow.
@Tom-Newton Tom-Newton added the bug Something isn't working label Jul 26, 2022
@Tom-Newton Tom-Newton changed the title Map types are not supported via the Python API DeltaTable.to_pyarrow_dataset() fails for tables containing map types Jul 26, 2022
@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Jul 26, 2022

I have a draft PR #712 which fixes the first issue and I'm investigating the second issue.

@Tom-Newton
Copy link
Contributor Author

It looks like 2 in progress PRs will fix this #703 #684

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Aug 10, 2022

It looks like 2 in progress PRs will fix this #703 #684

Ok both of these PRs have merged but map types are not quite working fully.

  1. We need to upgrade arrow by one additional version 18.0.0 -> 19.0.0 to include the fix to Support FFI / C Data Interface for MapType apache/arrow-rs#2037. feat: integrate with object_store / datafusion APIs #703 only upgraded to 18.0.0 because there is currently no release of datafusion that supports 19.0.0
  2. Some nested types have PyArrow casting issues. Potentially this could be resolved within delta-rs but it should definitely be resolved by https://issues.apache.org/jira/browse/ARROW-17349

@wjones127
Copy link
Collaborator

Hmm I just learned about an option in PyArrow use_compliant_nested_type, which might change some things. I'll look into this soon.

Docs: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter
Background: https://issues.apache.org/jira/browse/ARROW-11497

@wjones127
Copy link
Collaborator

FYI I fixed the upstream casting issue. I will be available in PyArrow 10.0.0, which will be released in the next couple weeks.

@Tom-Newton
Copy link
Contributor Author

I've given it a test using Pyarrow 10.0.0 and everything seems to be working. Thanks everyone who contributed to fixing this especially @wjones127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants