Fix estargz compression loses the original tar metadata #2352

ktock · 2021-09-08T02:56:20Z

Currently, eStargz compression doesn't preserve the original tar metadata (header bytes and their order).
This causes failure of TestGetRemote because an uncompressed blob converted from a gzip blob provides different digset
against the one converted from eStargz blob even if their original tar (computed by differ) are the same.

This commit solves this issue by fixing eStargz compressor to preserve original tar's metadata that is modified by eStargz. ~~The metadata is saved on the content store and used for decompression when the eStargz is converted into another compressionType.~~ (EDIT: fixed not to use content store)

tonistiigi · 2021-09-08T03:22:11Z

cache/converter.go

@@ -91,26 +105,14 @@ func getConverter(desc ocispecs.Descriptor, compressionType compression.Type) (c

 type conversion struct {
 	target     compression.Type
-	decompress func(io.Reader) (cdcompression.DecompressReadCloser, error)
+	decompress func(context.Context, ocispecs.Descriptor) (io.ReadCloser, error)


change it to ReaderAt or ReadSeeker instead. We should be able to define conversions for all supported compressions as a single array and then unit test them with random (tarball) data making sure that compression is lossless.

was there an issue with this?

change it to ReaderAt or ReadSeeker instead.

IIUC cdcompression.DecompressReadCloser is not seekable. So if we want to make it seekable we need to change the current decompress implementation to store the decompressed blob to somewhere seekable (e.g. content store) then return the readseeker. Does this change SGTY?

cache/estargz.go

ktock · 2021-09-08T15:14:59Z

Fixed to use lossless compressor/decompressor of estargz (ktock/stargz-snapshotter@f5999ba).

tonistiigi · 2021-09-08T15:21:33Z

Fixed to use lossless compressor/decompressor of estargz

You don't seem to be using tar-split parser there for encode phase. Therefore are you sure it works? It might work fine if the initial tarball was created by (same version of) go as the encoding in estargz and original tar matches. But try it with a tarball from GNU tar for example. Tar allows padding etc so there are many ways how the same files could end up encoded into a tarball.

ktock · 2021-09-08T15:43:50Z

@tonistiigi

You don't seem to be using tar-split parser there for encode phase. Therefore are you sure it works?

It just streams the original tar into the destination writer using io.TeeReader with adding gzip headers + compressing as gzip for each file (https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/estargz.go#L762) so it doesn't encode the tar headers anymore.

It might work fine if the initial tarball was created by (same version of) go as the encoding in estargz and original tar matches. But try it with a tarball from GNU tar for example. Tar allows padding etc so there are many ways how the same files could end up encoded into a tarball.

It's tested against three formats of tar (https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/testutil.go#L621) but it uses go library so I'll try with tools outside golang.

tonistiigi · 2021-09-08T16:37:29Z

It just streams the original tar into the destination writer using io.TeeReader with adding gzip headers + compressing as gzip for each file (https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/estargz.go#L762) so it doesn't encode the tar headers anymore.

Interesting. It looks like the gzip writer is switched in the parser function https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/estargz.go#L804 but the actual write happens in the tee. I don't think this is safe. The tar parser may read more than the header, writing it to the gzip writer, then return some of it, leaving the rest to the internal buffer. The intention should be to gzip all records precisely with individual gzip writer, but atm I don't think the tar block and gzip blocks match.

ktock · 2021-09-08T18:41:58Z

I don't think this is safe. The tar parser may read more than the header, writing it to the gzip writer, then return some of it, leaving the rest to the internal buffer. The intention should be to gzip all records precisely with individual gzip writer, but atm I don't think the tar block and gzip blocks match.

Fixed to use tar-split (https://github.com/ktock/stargz-snapshotter/blob/aac1cee508136d135d55d9f51629bf9b493e048f/estargz/estargz.go#L817).

cache/manager_test.go

tonistiigi · 2021-09-14T01:10:09Z

cache/estargz.go

+		return false
+	}
+	defer r.Close()
+	_, _, err = estargz.OpenFooter(io.NewSectionReader(r, 0, r.Size()))


Could we leave a label on the blob after this has been determined for the first time and then avoid opening blob after that

tonistiigi · 2021-09-14T01:10:48Z

cache/manager_test.go

+						require.NoError(t, err, testName)
+					}
+
+					// Check the uncompresed digest is the same as the original


uncompressed

tonistiigi · 2021-09-14T01:13:14Z

cache/estargz.go

+				// Note that we don't support eStragz compression for tar that contains a file named
+				// `stargz.index.json` because we cannot create eStargz in loseless way for such blob
+				// (we must overwrite stargz.index.json file).
+				if err := w.AppendTarLossLess(pr); err != nil {


Bit worried that other tools can still create unsupported tars. Can that other API be removed? What is the behavior when such tarballs are pulled? Maybe they should fail the isEstargz check?

The cases where the unsupported tar (= contains stargz.index.json) reach here will be the following

invalid eStargz: the source blob is a broken (or malformed) eStargz that contains stargz.index.json but has an invalid footer. In this case, the blob is recognized as a normal gzip blob and decompressEStargz is not applied.

name conflict: the source blob is a non-eStargz blob (tar/gzip/zstd) but contains stargz.index.json which is not TOC.

(If the source blob is an eStargz, the decompressor (decompressEStargz) removes the TOC entry before appending so AppendTarLossLess won't return the error.)

I think these cases are rare thus it's reasonable to return an error here.

Yes, I didn't think the error case but just that AppendTarLossLess is an additional API and AppendTar still remains. So other tools can create estargz that does not decompress properly. Can we make this requirement for stargz lib, and what is the behavior when we pull such blob that was not created with a lossless encoder.

@tonistiigi

other tools can create estargz that does not decompress properly.

If the client converts a plain-gzip image A to eStargz B on their side with tools other than BuildKit, B doesn't hold the original bits so it's treated as a different image than A when pulled to BuildKit. When BuildKit converts B into a tar image, the produced tar is different from what is contained in A.

what is the behavior when we pull such blob that was not created with a lossless encoder.

Could you elaborate on the problem of the conversion happening outside of BuildKit? When it comes to tools other than BuildKit, they mostly need to change tar bits for enabling optimization which sorts tar entries in priority order and adds some landmark dummy files. I think BuildKit cache should treat such blobs as different one from the original.

Is there any reason other tools couldn't just also use the lossless method. Why can't this be default?

Is it still correct that isEstargz detects lossy estargz as estargz or should it use it as regular gzip and try to recompress with lossless method?

tonistiigi · 2021-09-14T01:14:41Z

go.mod

 	gotest.tools/v3 v3.0.3 // indirect
 )

+require github.com/vbatts/tar-split v0.11.2 // indirect


this looks to be outside the block

tonistiigi · 2021-09-14T01:15:08Z

go.mod

 	github.com/golang/protobuf v1.5.2
-	// snappy: updated for go1.17 support


Is this deliberate?

make vendor automatically do this. Maybe we need a replace if we want to lock the snappy version.

hmm. What's the difference?

I believe this will be fixed by #2348.

tonistiigi · 2021-09-14T01:16:36Z

cache/converter.go

@@ -91,26 +105,14 @@ func getConverter(desc ocispecs.Descriptor, compressionType compression.Type) (c

 type conversion struct {
 	target     compression.Type
-	decompress func(io.Reader) (cdcompression.DecompressReadCloser, error)
+	decompress func(context.Context, ocispecs.Descriptor) (io.ReadCloser, error)


was there an issue with this?

tonistiigi

Overall LGTM but I think we need to wait for #2348 to sort out the go.mod issues and rebase then.

tonistiigi · 2021-09-21T20:35:08Z

@ktock needs rebase

Currently, eStargz compression doesn't preserve the original tar metadata (header bytes and their order). This causes failure of `TestGetRemote` because an uncompressed blob converted from a gzip blob provides different digset against the one converted from eStargz blob even if their original tar (computed by differ) are the same. This commit solves this issue by fixing eStargz to preserve original tar's metadata that is modified by eStargz. Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>

tonistiigi · 2021-09-23T19:39:54Z

@AkihiroSuda sgty?

tonistiigi reviewed Sep 8, 2021

View reviewed changes

cache/estargz.go Outdated Show resolved Hide resolved

ktock force-pushed the esgzcvt-preserve-tar branch from 43b45b4 to d90d0fa Compare September 8, 2021 15:05

ktock force-pushed the esgzcvt-preserve-tar branch from 04e0307 to ce62cbb Compare September 9, 2021 01:53

ktock mentioned this pull request Sep 9, 2021

estargz: support lossless compression containerd/stargz-snapshotter#453

Merged

ktock commented Sep 9, 2021

View reviewed changes

cache/manager_test.go Show resolved Hide resolved

ktock force-pushed the esgzcvt-preserve-tar branch from ce62cbb to 5ee07ef Compare September 10, 2021 10:51

ktock marked this pull request as ready for review September 10, 2021 11:29

ktock requested a review from tonistiigi September 13, 2021 01:51

tonistiigi reviewed Sep 14, 2021

View reviewed changes

ktock force-pushed the esgzcvt-preserve-tar branch 2 times, most recently from 971e364 to cb61cda Compare September 14, 2021 07:38

tonistiigi approved these changes Sep 15, 2021

View reviewed changes

tonistiigi requested a review from AkihiroSuda September 15, 2021 18:18

tonistiigi added status/do-not-merge and removed status/do-not-merge labels Sep 15, 2021

ktock force-pushed the esgzcvt-preserve-tar branch from cb61cda to 63576c2 Compare September 22, 2021 01:42

ktock force-pushed the esgzcvt-preserve-tar branch from 63576c2 to da821a4 Compare September 22, 2021 01:51

tonistiigi approved these changes Sep 22, 2021

View reviewed changes

AkihiroSuda approved these changes Sep 24, 2021

View reviewed changes

AkihiroSuda merged commit ec787d9 into moby:master Sep 24, 2021

crazy-max mentioned this pull request Oct 25, 2021

vendor buildkit v0.9.1-0.20211025222436-33fb83eb7166 moby/moby#42968

Closed

crazy-max added this to the v0.10.0 milestone Feb 4, 2022

aledbf mentioned this pull request Mar 16, 2022

Update buildkit to v0.10.0 gitpod-io/gitpod#8845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix estargz compression loses the original tar metadata #2352

Fix estargz compression loses the original tar metadata #2352

ktock commented Sep 8, 2021 •

edited

tonistiigi Sep 8, 2021

tonistiigi Sep 14, 2021

ktock Sep 14, 2021

ktock commented Sep 8, 2021

tonistiigi commented Sep 8, 2021

ktock commented Sep 8, 2021 •

edited

tonistiigi commented Sep 8, 2021

ktock commented Sep 8, 2021 •

edited

tonistiigi Sep 14, 2021

tonistiigi Sep 14, 2021

tonistiigi Sep 14, 2021 •

edited

ktock Sep 14, 2021

tonistiigi Sep 14, 2021

ktock Sep 15, 2021

tonistiigi Sep 15, 2021

tonistiigi Sep 14, 2021

tonistiigi Sep 14, 2021

ktock Sep 14, 2021

tonistiigi Sep 14, 2021

ktock Sep 15, 2021

tonistiigi Sep 14, 2021

tonistiigi left a comment

tonistiigi commented Sep 21, 2021

tonistiigi commented Sep 23, 2021

		github.com/golang/protobuf v1.5.2
		// snappy: updated for go1.17 support

Fix estargz compression loses the original tar metadata #2352

Fix estargz compression loses the original tar metadata #2352

Conversation

ktock commented Sep 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ktock commented Sep 8, 2021

tonistiigi commented Sep 8, 2021

ktock commented Sep 8, 2021 • edited

tonistiigi commented Sep 8, 2021

ktock commented Sep 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonistiigi Sep 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonistiigi left a comment

Choose a reason for hiding this comment

tonistiigi commented Sep 21, 2021

tonistiigi commented Sep 23, 2021

ktock commented Sep 8, 2021 •

edited

ktock commented Sep 8, 2021 •

edited

ktock commented Sep 8, 2021 •

edited

tonistiigi Sep 14, 2021 •

edited