-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Nessie GC fails to handle Iceberg column names with quotes in S3 #8328
Comments
Hi @clintf1982 , Could you be more specific in the "How to reproduce it" section about how the invalid URI happened to be introduced into the Iceberg files? Were all URIs correct before running Nessie GC? What was the original URI that got quotes after GC? Thx! |
@dimas-b, sorry and thanks Dima, I rephrased and added all the details, hope it is enough. Let me know if anything is missing. |
Example S3 URI: Corresponding AWS download URL (truncated): |
@clintf1982 : Do you really need quotes in the column name or was it a co-incidence perhaps? This still looks like a bug in Nessie GC, but I wonder if it is a blocker for you :) |
I was able to reproduce with a simpler script (manually): Spark SQL:
GC:
Exception:
|
Note: Nessie GC appears to work fine if/when the column name does not have quotes, e.g.:
|
@dimas-b
Thank you for time, it is appreciated. I will follow this issue to see how we should move forward. |
@dimas-b It is a blocker for us, we have many tables with quoted columns and we can't GC our data. |
Iceberg metadata ( |
... because some S3 URIs are not parseable by java.net.URI Fixes projectnessie#8328
... because some S3 URIs are not parseable by java.net.URI Fixes projectnessie#8328
... because some S3 URIs are not parseable by java.net.URI Fixes projectnessie#8328
... because some S3 URIs are not parseable by java.net.URI Fixes projectnessie#8328
* Use custom class for handling storage URIs ... because some S3 URIs are not parseable by java.net.URI Fixes #8328
@clintf1982 : This should be fixed in v. 0.81.0 |
@clintf1982 if you have |
The problem with |
@dimas-b Thanks a lot, I will check |
@dimas-b Currently we don't use # or ?, so PR10283 doesn't stop us. |
What happened
I created using spark and iceberg a table X with columns
"_id"
,B
,C
partitioned by truncate(2000,
"_id"
)When inserting data into the table, under the data folder of the table in S3, folders of this kind were created:
s3://bucket/folder/X_12c98633-5492-48b4-950f-ec23497db481/data/"_id"_trunc=0/
After running the GC tool, I received an exception:
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 63.
at java.base/java.net.URI.create(URI.java:906)
at org.projectnessie.gc.iceberg.IcebergContentToFiles.dataFileUri(IcebergContentToFiles.java:262)
at org.projectnessie.gc.iceberg.IcebergContentToFiles.lambda$allDataFiles$3(IcebergContentToFiles.java:158)
How to reproduce it
CREATE TABLE X (
"_id"
bigint NOT NULL COMMENT 'unique id',A big int,
B big int)
partitioned by truncate(2000,
"_id"
)2)
insert into X values (100, 1, 1)
3)
drop table X
4)
Run the gc tool with gc option to leave only the the last commit.
Nessie server type (docker/uber-jar/built from source) and version
Spark 3.4
Iceberg 1.3.0
Nessie 0.70.0
Client type (Ex: UI/Spark/pynessie ...) and version
Spark 3.4 (When creating the table and writing the data) with Iceberg tables.
Ran the GC tool from command line.
Additional information
No response
The text was updated successfully, but these errors were encountered: