Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarball: Streaming image parser PoC #1429

Closed
wants to merge 10 commits into from
Closed

Conversation

abitrolly
Copy link
Contributor

@abitrolly abitrolly commented Aug 14, 2022

TarBuffered scans stream (io.Reader) once for filename and saves
unused sections in memory for later access. This should speedup
parsing a bit, because right now tarball is scanned several times,
and should save resources and speed for parsing well-formed images
from network.

Solves #1339.

@codecov-commenter
Copy link

codecov-commenter commented Aug 14, 2022

Codecov Report

Merging #1429 (7f2ecfc) into main (7196cf3) will increase coverage by 0.00%.
The diff coverage is 83.63%.

@@           Coverage Diff           @@
##             main    #1429   +/-   ##
=======================================
  Coverage   73.40%   73.40%           
=======================================
  Files         115      115           
  Lines        8757     8804   +47     
=======================================
+ Hits         6428     6463   +35     
- Misses       1688     1696    +8     
- Partials      641      645    +4     
Impacted Files Coverage Δ
pkg/v1/tarball/image.go 76.47% <83.33%> (-0.49%) ⬇️
pkg/v1/mutate/mutate.go 72.16% <100.00%> (+0.09%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -26,8 +26,10 @@ import (
"path"
"path/filepath"
"sync"
"unsafe"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a lib to to track the size of all data that is referenced by this struct? unsafe.Sizeof returns the value 40 regardless of how many megabytes are already stored in content and headers map.

type TarBuffered struct {
	tf      *tar.Reader
	pos     int
	headers map[int]*tar.Header
	content map[int][]byte
	EOF     bool
}

@abitrolly
Copy link
Contributor Author

abitrolly commented Aug 15, 2022

The code structure with partial.go, returning innards of Image without actual reference to Image makes it rather hard to debug the flow when export fails with file already closed errors.

I don't see there could be an efficient way to avoid high memory usage without restructuring the image format itself opencontainers/image-spec#936

@abitrolly
Copy link
Contributor Author

This doesn't pass export from stdin stream on my machine. Need to write that test.

@abitrolly
Copy link
Contributor Author

Added #1436 that tests stdin and stdout export is not broken.

pkg/crane/export.go Fixed Show fixed Hide fixed
TarBuffered scans stream (`io.Reader`) once for filename and saves
unused sections in memory for later access. This should speedup
parsing a bit, because right now tarball is scanned several times,
and should save resources and speed for parsing well-formed images
from network.

See google#1339.
Go subclassing with partials is rather hardcore here
@@ -220,6 +221,7 @@
// If a caller doesn't read the full contents, they should Close it to free up
// resources used during extraction.
func Extract(img v1.Image) io.ReadCloser {
logs.Debug.Printf("mutate: extract %T", img)

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information

Sensitive data returned by [an access to password](1) is logged here. Sensitive data returned by [an access to Password](2) is logged here. Sensitive data returned by [an access to Password](3) is logged here.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how dumping type of v1.Image gives access to AuthConfig.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how that could happen either, but I can tell you this debug logging will probably be very noisy and likely not very useful. If it's helpful while working on this change that's fine, but we should remove it before reviewing and merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any tools to replace Debug prints while figuring out how Go code works? I am doing "pen and paper" reconstruction of the calls, which is not very effective.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically just rely on logging and pen and paper debugging, but I've heard good things about delve for interactive debugging, which is also integrated into things like VSCode.

@imjasonh
Copy link
Collaborator

The code structure with partial.go, returning innards of Image without actual reference to Image makes it rather hard to debug the flow when export fails with file already closed errors.

I don't see there could be an efficient way to avoid high memory usage without restructuring the image format itself opencontainers/image-spec#936

Yeah this may just be a limitation of the format. That format is unfortunately unlikely to change any time soon. Attempts have been made, the most successful of which is estargz which abuses the tar format to allow for random access reads. This wouldn't be helpful in your case though since you can't guarantee it will be in the estargz format.

Would it be reasonable to enforce some soft memory limit and fail if asked to buffer more than that during extraction? This could be configured with an env var maybe.

@abitrolly
Copy link
Contributor Author

estargz places TOC at the end, so it won't be effective indeed. But I see a way to make a linter out of this code to measure image processing overhead that can be converted to some FinOps metrics, so that people have a motivation to produce well-formed images to place those savings into their OKR reports.

@github-actions
Copy link

This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Keep fresh with the 'lifecycle/frozen' label.

@abitrolly
Copy link
Contributor Author

The thought that I badly need money that come with job keeps me away from completing this PR. Somewhat hard to process the code in a stressed state of mind, but I will try to complete this to put into resume or something.

In the meanwhile I found another contender for static network sync file format https://github.com/zchunk/zchunk which unlike estargz places header at the start. Maybe estargz can be modified too to place TOC at start for streaming access?

@github-actions
Copy link

This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Keep fresh with the 'lifecycle/frozen' label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants