Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot query S3 paths containing whitespace #2799

Closed
andygrove opened this issue Sep 28, 2022 · 8 comments
Closed

Cannot query S3 paths containing whitespace #2799

andygrove opened this issue Sep 28, 2022 · 8 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface

Comments

@andygrove
Copy link
Member

Describe the bug

I cannot register a path containing a space.

create external table test2 stored as parquet location 's3://andygrove-benchmark-data/trip data/yellow_tripdata_2022-06.parquet';
ObjectStore(NotFound { path: "trip%20data/yellow_tripdata_2022-06.parquet"

It works if I change the path to not have spaces:

create external table test stored as parquet location 's3://andygrove-benchmark-data/trip_data/yellow_tripdata_2022-06.parquet';
0 rows in set. Query took 0.429 seconds.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

@andygrove andygrove added the bug label Sep 28, 2022
@andygrove andygrove transferred this issue from apache/datafusion Sep 28, 2022
@andygrove
Copy link
Member Author

@tustvold Here is another potential object store bug that I found

@tustvold
Copy link
Contributor

tustvold commented Sep 28, 2022

This is sadly intentional, https://docs.rs/object_store/latest/object_store/path/struct.Path.html#path-safety, I'm not really sure what can be done about this...

@andygrove
Copy link
Member Author

We should add least add a check and throw an error? The S3 error response is somewhat obscure.

It will be a shame if DataFusion and Ballista cannot support querying certain public data sets like nyc-tlc because they have spaces in paths. Other query engines support it. One example: https://coiled.io/blog/nyc-taxi-parquet-csv-index-error/

I will plan on digging into this more soon to understand the issue better.

@tustvold
Copy link
Contributor

tustvold commented Sep 28, 2022

We should add least add a check and throw an error?

Agreed, I would have expected DataFusion to refuse that query as the parquet location is not a valid URL. Possibly ListingTableUrl is escaping rather than parsing.

It will be a shame if DataFusion and Ballista cannot support querying certain public data sets

I welcome alternative suggestions for how to handle path escaping

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface and removed bug labels Sep 28, 2022
@andygrove
Copy link
Member Author

Perhaps some kind of user-provided config to override the default behavior? I will take a look this weekend.

@tustvold
Copy link
Contributor

Actually looking into this, we don't disallow spaces in paths, I think this might be a DataFusion bug 🤔

@tustvold
Copy link
Contributor

There is a somewhat related issue I ran into when playing with this, but you most definitely can create, read, etc... objects with paths containing spaces - #2800

@tustvold
Copy link
Contributor

I think this was fixed by #2801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface
Projects
None yet
Development

No branches or pull requests

2 participants