[Bug]: Nessie GC fails to handle Iceberg column names with quotes in S3 #8328

clintf1982 · 2024-04-14T06:54:49Z

What happened

I created using spark and iceberg a table X with columns "_id",B, C
partitioned by truncate(2000, "_id")
When inserting data into the table, under the data folder of the table in S3, folders of this kind were created:
s3://bucket/folder/X_12c98633-5492-48b4-950f-ec23497db481/data/"_id"_trunc=0/
After running the GC tool, I received an exception:
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 63.
at java.base/java.net.URI.create(URI.java:906)
at org.projectnessie.gc.iceberg.IcebergContentToFiles.dataFileUri(IcebergContentToFiles.java:262)
at org.projectnessie.gc.iceberg.IcebergContentToFiles.lambda$allDataFiles$3(IcebergContentToFiles.java:158)

How to reproduce it

CREATE TABLE X (
"_id" bigint NOT NULL COMMENT 'unique id',
A big int,
B big int)
partitioned by truncate(2000, "_id")
2)
insert into X values (100, 1, 1)
3)
drop table X
4)
Run the gc tool with gc option to leave only the the last commit.

Nessie server type (docker/uber-jar/built from source) and version

Spark 3.4
Iceberg 1.3.0
Nessie 0.70.0

Client type (Ex: UI/Spark/pynessie ...) and version

Spark 3.4 (When creating the table and writing the data) with Iceberg tables.
Ran the GC tool from command line.

Additional information

No response

The text was updated successfully, but these errors were encountered:

dimas-b · 2024-04-15T13:36:57Z

Hi @clintf1982 , Could you be more specific in the "How to reproduce it" section about how the invalid URI happened to be introduced into the Iceberg files? Were all URIs correct before running Nessie GC? What was the original URI that got quotes after GC? Thx!

clintf1982 · 2024-04-16T05:15:09Z

@dimas-b, sorry and thanks Dima, I rephrased and added all the details, hope it is enough. Let me know if anything is missing.

dimas-b · 2024-04-16T17:33:40Z

Example S3 URI: s3://EXAMPLE_BUCKET/t3_f1808b75-737b-418a-b808-68c00847a4ab/data/"_id"_trunc=0/00000-1-65449b51-b183-4cb5-ba16-9802425dab38-0-00001.parquet

Corresponding AWS download URL (truncated): https://EXAMPLE_BUCKET.s3.us-west-2.amazonaws.com/t3_f1808b75-737b-418a-b808-68c00847a4ab/data/%22_id%22_trunc%3D0/00000-1-65449b51-b183-4cb5-ba16-9802425dab38-0-00001.parquet?response-content-disposition=attachment

dimas-b · 2024-04-16T17:34:26Z

@clintf1982 : Do you really need quotes in the column name or was it a co-incidence perhaps?

This still looks like a bug in Nessie GC, but I wonder if it is a blocker for you :)

dimas-b · 2024-04-16T17:51:27Z

I was able to reproduce with a simpler script (manually):

Spark SQL:

spark-sql ()> create table t3(`"_id"` bigint not null, a int) partitioned by (truncate(2000, `"_id"`));
Time taken: 0.641 seconds
spark-sql ()> insert into t3 values (1,1);
Time taken: 3.629 seconds
spark-sql ()> select * from t3;
1	1
Time taken: 1.637 seconds, Fetched 1 row(s)
spark-sql ()> describe t3;
"_id"               	bigint              	                    
a                   	int                 	                    
                    	                    	                    
# Partitioning      	                    	                    
Part 0              	truncate(2000, `"_id"`)	                    
Time taken: 0.05 seconds, Fetched 5 row(s)

GC:

$ docker run --network host \
  -e AWS_REGION=$AWS_REGION \
  -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
  -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID  \
  quay.io/projectnessie/nessie-gc:0.79.0 gc --inmemory

Exception:

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Illegal character in path at index 68: s3://EXAMPLE_BUCKET/t3_f1808b75-737b-418a-b808-68c00847a4ab/data/"_id"_trunc=0/00000-1-65449b51-b183-4cb5-ba16-9802425dab38-0-00001.parquet
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:540)
	at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567)
	at java.base/java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:653)
	at java.base/java.util.concurrent.ForkJoinPool.invoke(ForkJoinPool.java:2822)
	at org.projectnessie.gc.expire.local.DefaultLocalExpire.expire(DefaultLocalExpire.java:73)
[...]

dimas-b · 2024-04-16T17:57:57Z

Note: Nessie GC appears to work fine if/when the column name does not have quotes, e.g.:

spark-sql ()> create table t2(_id bigint not null, a int) partitioned by (truncate(2000, `_id`));
Time taken: 3.806 seconds
spark-sql ()> insert into t2 values (1,1);
Time taken: 4.819 seconds
spark-sql ()> select * from t2;
1	1
Time taken: 1.989 seconds, Fetched 1 row(s)
spark-sql ()> describe t2;
_id                 	bigint              	                    
a                   	int                 	                    
                    	                    	                    
# Partitioning      	                    	                    
Part 0              	truncate(2000, _id) 	                    
Time taken: 0.223 seconds, Fetched 5 row(s)

clintf1982 · 2024-04-16T19:10:17Z

@dimas-b
Yeah, yours is simpler. First off, i'm glad everything is clear and you successed to reproduce.

You are right, if we wouldn't have columns with quotes, there wouldn't be a problem, however, even if we changed now the columns and removed quotes, the history still has the quotes and the GC would fail(At least I think so), which makes us unable at all to clean the unused data/metadata at this point.
Since there wasn't a failure by nessie/iceberg/S3 to write such files, then the GC shouldn't fail on these files. I guess java.net.URI is not suitable for S3 although as far as I saw in the code java.net.URI is heavily used.

Thank you for time, it is appreciated. I will follow this issue to see how we should move forward.

clintf1982 · 2024-04-16T19:12:18Z

@dimas-b It is a blocker for us, we have many tables with quoted columns and we can't GC our data.

dimas-b · 2024-04-17T13:53:14Z

Iceberg metadata (GenericDataFile) appears to store S3 URIs without escaping quotes 🤔

... because some S3 URIs are not parseable by java.net.URI Fixes projectnessie#8328

* Use custom class for handling storage URIs ... because some S3 URIs are not parseable by java.net.URI Fixes #8328

dimas-b · 2024-05-03T12:44:08Z

@clintf1982 : This should be fixed in v. 0.81.0

dimas-b · 2024-05-03T22:50:51Z

@clintf1982 if you have # in column names, please do not use GC in 0.81.0... there seems to be another latent bug... we're investigating.

dimas-b · 2024-05-06T21:55:31Z

The problem with # and ? chars appears to be an Iceberg-side concern: apache/iceberg#10279

clintf1982 · 2024-05-09T12:52:59Z

@dimas-b Thanks a lot, I will check

clintf1982 · 2024-05-12T05:45:07Z

@dimas-b Currently we don't use # or ?, so PR10283 doesn't stop us.
The work done here in GC 0.81.0 is very appreciated, I guess I will be able to check it on our data soon.

dimas-b self-assigned this Apr 16, 2024

dimas-b changed the title ~~[Bug]: In GC - URI parser doesn't handle all cases for S3 paths~~ [Bug]: NEssie GC fails to handle Iceberg column names with quotes in S3 Apr 16, 2024

dimas-b changed the title ~~[Bug]: NEssie GC fails to handle Iceberg column names with quotes in S3~~ [Bug]: Nessie GC fails to handle Iceberg column names with quotes in S3 Apr 16, 2024

dimas-b mentioned this issue Apr 17, 2024

Restrict generated locations to URI syntax apache/iceberg#10168

Open

dimas-b mentioned this issue Apr 29, 2024

Use custom class for handling storage URIs #8420

Merged

dimas-b added a commit to dimas-b/nessie that referenced this issue Apr 29, 2024

Use custom class for handling storage URIs

e045322

... because some S3 URIs are not parseable by java.net.URI Fixes projectnessie#8328

dimas-b added a commit to dimas-b/nessie that referenced this issue Apr 30, 2024

Use custom class for handling storage URIs

ca253ab

... because some S3 URIs are not parseable by java.net.URI Fixes projectnessie#8328

dimas-b added a commit to dimas-b/nessie that referenced this issue Apr 30, 2024

Use custom class for handling storage URIs

37fd889

... because some S3 URIs are not parseable by java.net.URI Fixes projectnessie#8328

dimas-b added a commit to dimas-b/nessie that referenced this issue May 2, 2024

Use custom class for handling storage URIs

321e06e

... because some S3 URIs are not parseable by java.net.URI Fixes projectnessie#8328

dimas-b closed this as completed in #8420 May 2, 2024

dimas-b added a commit that referenced this issue May 2, 2024

Use custom class for handling storage URIs (#8420)

49f4189

* Use custom class for handling storage URIs ... because some S3 URIs are not parseable by java.net.URI Fixes #8328

dimas-b mentioned this issue May 9, 2024

Support special chars in S3URI apache/iceberg#10283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Nessie GC fails to handle Iceberg column names with quotes in S3 #8328

[Bug]: Nessie GC fails to handle Iceberg column names with quotes in S3 #8328

clintf1982 commented Apr 14, 2024 •

edited

dimas-b commented Apr 15, 2024

clintf1982 commented Apr 16, 2024

dimas-b commented Apr 16, 2024 •

edited

dimas-b commented Apr 16, 2024 •

edited

dimas-b commented Apr 16, 2024 •

edited

dimas-b commented Apr 16, 2024

clintf1982 commented Apr 16, 2024

clintf1982 commented Apr 16, 2024

dimas-b commented Apr 17, 2024

dimas-b commented May 3, 2024

dimas-b commented May 3, 2024

dimas-b commented May 6, 2024 •

edited

clintf1982 commented May 9, 2024

clintf1982 commented May 12, 2024

[Bug]: Nessie GC fails to handle Iceberg column names with quotes in S3 #8328

[Bug]: Nessie GC fails to handle Iceberg column names with quotes in S3 #8328

Comments

clintf1982 commented Apr 14, 2024 • edited

What happened

How to reproduce it

Nessie server type (docker/uber-jar/built from source) and version

Client type (Ex: UI/Spark/pynessie ...) and version

Additional information

dimas-b commented Apr 15, 2024

clintf1982 commented Apr 16, 2024

dimas-b commented Apr 16, 2024 • edited

dimas-b commented Apr 16, 2024 • edited

dimas-b commented Apr 16, 2024 • edited

dimas-b commented Apr 16, 2024

clintf1982 commented Apr 16, 2024

clintf1982 commented Apr 16, 2024

dimas-b commented Apr 17, 2024

dimas-b commented May 3, 2024

dimas-b commented May 3, 2024

dimas-b commented May 6, 2024 • edited

clintf1982 commented May 9, 2024

clintf1982 commented May 12, 2024

clintf1982 commented Apr 14, 2024 •

edited

dimas-b commented Apr 16, 2024 •

edited

dimas-b commented Apr 16, 2024 •

edited

dimas-b commented Apr 16, 2024 •

edited

dimas-b commented May 6, 2024 •

edited