Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Record (TOC digest → DiffID) mapping in BlobInfoCache #2321

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

mtrmac
Copy link
Collaborator

@mtrmac mtrmac commented Feb 29, 2024

A single DiffID may map to multiple TOC digest values. Record that in BlobInfoCache, and use it for layer reuse.

Also prefer reusing even TOC-matched layers by DiffID, when available.

@giuseppe I’d appreciate a preliminary review of the new logic; see individual commits.

Draft: The BlobInfoCache implementations don’t actually store/record any data yet — so this is obviously completely untested.

// UncompressedDigest returns an uncompressed digest corresponding to anyDigest.
// Returns "" if the uncompressed digest is unknown.
// FIXME: Does this need to record TOC/compression type?
UncompressedDigestForTOC(tocDigest digest.Digest) digest.Digest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TOC digest is the checksum of the uncompressed JSON document, so I think the compression should not matter in this case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we probably don’t need that right now (with GetTOCDigest refusing to work on manifests which contain multiple TOC digest annotations, and presumably with the zstd / estargz code being unable to decompress the other one).

This comment is a looking a bit more into the future, for lookups in the other direction, where we will want to look up (UncompressedDigest → (compressed digest, TOC digest, algorithm)) and match that against “the user wants the destination to contain zstd:chunked” (i.e. reject estargz matches).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for lookups in the other direction,

That will be done in a separate data structure (an extension of RecordDigestCompressorName: We need the full set of annotations for reuse of a TOC-compressed blob, so this simple mapping is not sufficient anyway. And the other structure does record the algorithm.

Copy link
Collaborator Author

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: This is code-complete but I want to test it in practice.

Comment on lines +88 to +89
// (and we assume the TOC digest also uniquely identifies the contents, i.e. there aren’t two
// different formats/ways to parse a single TOC).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the c/storage+c/image code has been built around this assumption, but it is false currently (containers/storage#1888 ) and I’m not sure whether we need to revisit the design. Let’s discuss that in the c/storage issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this assumption is correct

@mtrmac mtrmac force-pushed the chunked-bic branch 2 times, most recently from 2a542f7 to 9e3cace Compare April 24, 2024 18:26
Should not change behavior.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
The new code is not called, so it should not change behavior
(apart from extending the BoltDB/SQLite schema).

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
…storage by DiffID

If we can, prefer identifying layers by DiffID, because multiple TOCs can map to the
same DiffID; and because it maximizes reuse with non-TOC layers.

For now, the new situation is unreachable.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
We will add one more instance of this, so share the code.

Should not change behavior (it does remove one unreachable code path).

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
… is known

- Multiple TOC values might correspond to a single DiffID (e.g. if different
  compression levels are used); try to share them all, identified by DiffID
  (so that we also reuse with non-TOC pulls).
  - LayersByTOCDigest only uses a single TOC digest per layer; BlobInfoCache
    allows multiple matches, matches layers which have been since deleted,
    and potentially matches TOC digests which we have created by pushing
    but haven't pulled yet.
- On reuse, we can now use DiffID-based layer identities even if the reuse
  was TOC~driven.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
…hole layer

This is similar to what putBlobToPendingFile does.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
…yers

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
@mtrmac
Copy link
Collaborator Author

mtrmac commented Apr 25, 2024

To test:

Before:

# podman rmi alpine level1 level9
# rm -f /var/lib/containers/cache/blob-info-cache-v1.sqlite 
# podman pull quay.io/libpod/alpine
# podman --log-level=debug push --compression-format zstd:chunked --compression-level 1 --force-compression quay.io/libpod/alpine localhost:50000/level1
## Even better would be to use two different destination registries, to be 100% certain the blobs are not reused
## (right now they are not reused, but we’ll fix that):
# podman--log-level=debug push --compression-format zstd:chunked --compression-level 9 --force-compression quay.io/libpod/alpine localhost:50000/level9
## Note the compressed digest, and TOC digest, values:
# skopeo inspect --raw docker://localhost:50000/level1 | jq .
# skopeo inspect --raw docker://localhost:50000/level9 | jq .
## No DigestTOCUncompressedPairs entries:
# sqlite3 /var/lib/containers/cache/blob-info-cache-v1.sqlite .dump 
# podman rmi alpine level1 level9
## Triggers a partial pull: "Applying differ in …":
# podman --log-level=debug pull localhost:50000/level1
## Triggers a partial pull: "Applying differ in …"
# podman --log-level=debug pull localhost:50000/level9 
## level1 and level9 have different image IDs:
# podman images 
## Contains two copies of the layer, with the same expected-layer-diffid
# jq . < /var/lib/containers/storage/overlay-layers/layers.json ```

After:

  • DigestTOCUncompressedPairs contains 2 records
  • Pull of level1 triggers a partial pull (creating a layer with known TOC digest and uncompressed digest)
  • Pull of level9 reuses the layer (by BIC compressed -> uncompressed mapping)
  • FIXME: the layer is shared, but the image not yet - the hasLayerPulledByTOC code path is wrong

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature A request for, or a PR adding, new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants