Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Nessie GC fails to handle Iceberg column names with quotes in S3 #8328

Closed
clintf1982 opened this issue Apr 14, 2024 · 14 comments · Fixed by #8420
Closed

[Bug]: Nessie GC fails to handle Iceberg column names with quotes in S3 #8328

clintf1982 opened this issue Apr 14, 2024 · 14 comments · Fixed by #8420
Assignees

Comments

@clintf1982
Copy link

clintf1982 commented Apr 14, 2024

What happened

I created using spark and iceberg a table X with columns "_id",B, C
partitioned by truncate(2000, "_id")
When inserting data into the table, under the data folder of the table in S3, folders of this kind were created:
s3://bucket/folder/X_12c98633-5492-48b4-950f-ec23497db481/data/"_id"_trunc=0/
After running the GC tool, I received an exception:
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 63.
at java.base/java.net.URI.create(URI.java:906)
at org.projectnessie.gc.iceberg.IcebergContentToFiles.dataFileUri(IcebergContentToFiles.java:262)
at org.projectnessie.gc.iceberg.IcebergContentToFiles.lambda$allDataFiles$3(IcebergContentToFiles.java:158)

How to reproduce it

CREATE TABLE X (
"_id" bigint NOT NULL COMMENT 'unique id',
A big int,
B big int)
partitioned by truncate(2000, "_id")
2)
insert into X values (100, 1, 1)
3)
drop table X
4)
Run the gc tool with gc option to leave only the the last commit.

Nessie server type (docker/uber-jar/built from source) and version

Spark 3.4
Iceberg 1.3.0
Nessie 0.70.0

Client type (Ex: UI/Spark/pynessie ...) and version

Spark 3.4 (When creating the table and writing the data) with Iceberg tables.
Ran the GC tool from command line.

Additional information

No response

@dimas-b
Copy link
Member

dimas-b commented Apr 15, 2024

Hi @clintf1982 , Could you be more specific in the "How to reproduce it" section about how the invalid URI happened to be introduced into the Iceberg files? Were all URIs correct before running Nessie GC? What was the original URI that got quotes after GC? Thx!

@clintf1982
Copy link
Author

@dimas-b, sorry and thanks Dima, I rephrased and added all the details, hope it is enough. Let me know if anything is missing.

@dimas-b dimas-b self-assigned this Apr 16, 2024
@dimas-b
Copy link
Member

dimas-b commented Apr 16, 2024

Example S3 URI: s3://EXAMPLE_BUCKET/t3_f1808b75-737b-418a-b808-68c00847a4ab/data/"_id"_trunc=0/00000-1-65449b51-b183-4cb5-ba16-9802425dab38-0-00001.parquet

Corresponding AWS download URL (truncated): https://EXAMPLE_BUCKET.s3.us-west-2.amazonaws.com/t3_f1808b75-737b-418a-b808-68c00847a4ab/data/%22_id%22_trunc%3D0/00000-1-65449b51-b183-4cb5-ba16-9802425dab38-0-00001.parquet?response-content-disposition=attachment

@dimas-b
Copy link
Member

dimas-b commented Apr 16, 2024

@clintf1982 : Do you really need quotes in the column name or was it a co-incidence perhaps?

This still looks like a bug in Nessie GC, but I wonder if it is a blocker for you :)

@dimas-b
Copy link
Member

dimas-b commented Apr 16, 2024

I was able to reproduce with a simpler script (manually):

Spark SQL:

spark-sql ()> create table t3(`"_id"` bigint not null, a int) partitioned by (truncate(2000, `"_id"`));
Time taken: 0.641 seconds
spark-sql ()> insert into t3 values (1,1);
Time taken: 3.629 seconds
spark-sql ()> select * from t3;
1	1
Time taken: 1.637 seconds, Fetched 1 row(s)
spark-sql ()> describe t3;
"_id"               	bigint              	                    
a                   	int                 	                    
                    	                    	                    
# Partitioning      	                    	                    
Part 0              	truncate(2000, `"_id"`)	                    
Time taken: 0.05 seconds, Fetched 5 row(s)

GC:

$ docker run --network host \
  -e AWS_REGION=$AWS_REGION \
  -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
  -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID  \
  quay.io/projectnessie/nessie-gc:0.79.0 gc --inmemory

Exception:

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Illegal character in path at index 68: s3://EXAMPLE_BUCKET/t3_f1808b75-737b-418a-b808-68c00847a4ab/data/"_id"_trunc=0/00000-1-65449b51-b183-4cb5-ba16-9802425dab38-0-00001.parquet
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:540)
	at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567)
	at java.base/java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:653)
	at java.base/java.util.concurrent.ForkJoinPool.invoke(ForkJoinPool.java:2822)
	at org.projectnessie.gc.expire.local.DefaultLocalExpire.expire(DefaultLocalExpire.java:73)
[...]

@dimas-b
Copy link
Member

dimas-b commented Apr 16, 2024

Note: Nessie GC appears to work fine if/when the column name does not have quotes, e.g.:

spark-sql ()> create table t2(_id bigint not null, a int) partitioned by (truncate(2000, `_id`));
Time taken: 3.806 seconds
spark-sql ()> insert into t2 values (1,1);
Time taken: 4.819 seconds
spark-sql ()> select * from t2;
1	1
Time taken: 1.989 seconds, Fetched 1 row(s)
spark-sql ()> describe t2;
_id                 	bigint              	                    
a                   	int                 	                    
                    	                    	                    
# Partitioning      	                    	                    
Part 0              	truncate(2000, _id) 	                    
Time taken: 0.223 seconds, Fetched 5 row(s)

@clintf1982
Copy link
Author

@dimas-b
Yeah, yours is simpler. First off, i'm glad everything is clear and you successed to reproduce.

  1. You are right, if we wouldn't have columns with quotes, there wouldn't be a problem, however, even if we changed now the columns and removed quotes, the history still has the quotes and the GC would fail(At least I think so), which makes us unable at all to clean the unused data/metadata at this point.
  2. Since there wasn't a failure by nessie/iceberg/S3 to write such files, then the GC shouldn't fail on these files. I guess java.net.URI is not suitable for S3 although as far as I saw in the code java.net.URI is heavily used.

Thank you for time, it is appreciated. I will follow this issue to see how we should move forward.

@clintf1982
Copy link
Author

@dimas-b It is a blocker for us, we have many tables with quoted columns and we can't GC our data.

@dimas-b dimas-b changed the title [Bug]: In GC - URI parser doesn't handle all cases for S3 paths [Bug]: NEssie GC fails to handle Iceberg column names with quotes in S3 Apr 16, 2024
@dimas-b dimas-b changed the title [Bug]: NEssie GC fails to handle Iceberg column names with quotes in S3 [Bug]: Nessie GC fails to handle Iceberg column names with quotes in S3 Apr 16, 2024
@dimas-b
Copy link
Member

dimas-b commented Apr 17, 2024

Iceberg metadata (GenericDataFile) appears to store S3 URIs without escaping quotes 🤔

dimas-b added a commit to dimas-b/nessie that referenced this issue Apr 29, 2024
... because some S3 URIs are not parseable by java.net.URI

Fixes projectnessie#8328
dimas-b added a commit to dimas-b/nessie that referenced this issue Apr 30, 2024
... because some S3 URIs are not parseable by java.net.URI

Fixes projectnessie#8328
dimas-b added a commit to dimas-b/nessie that referenced this issue Apr 30, 2024
... because some S3 URIs are not parseable by java.net.URI

Fixes projectnessie#8328
dimas-b added a commit to dimas-b/nessie that referenced this issue May 2, 2024
... because some S3 URIs are not parseable by java.net.URI

Fixes projectnessie#8328
dimas-b added a commit that referenced this issue May 2, 2024
* Use custom class for handling storage URIs

... because some S3 URIs are not parseable by java.net.URI

Fixes #8328
@dimas-b
Copy link
Member

dimas-b commented May 3, 2024

@clintf1982 : This should be fixed in v. 0.81.0

@dimas-b
Copy link
Member

dimas-b commented May 3, 2024

@clintf1982 if you have # in column names, please do not use GC in 0.81.0... there seems to be another latent bug... we're investigating.

@dimas-b
Copy link
Member

dimas-b commented May 6, 2024

The problem with # and ? chars appears to be an Iceberg-side concern: apache/iceberg#10279

@clintf1982
Copy link
Author

@dimas-b Thanks a lot, I will check

@clintf1982
Copy link
Author

@dimas-b Currently we don't use # or ?, so PR10283 doesn't stop us.
The work done here in GC 0.81.0 is very appreciated, I guess I will be able to check it on our data soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants