Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd:chunked blocker: TarSplitChecksumKey not used in a layer ID #1888

Open
mtrmac opened this issue Apr 13, 2024 · 6 comments
Open

zstd:chunked blocker: TarSplitChecksumKey not used in a layer ID #1888

mtrmac opened this issue Apr 13, 2024 · 6 comments

Comments

@mtrmac
Copy link
Collaborator

mtrmac commented Apr 13, 2024

@giuseppe Zstd:chunked has uses a TarSplitChecksumKey annotation, separate from the TOC digest.

That’s fine as far as per-image individual trust goes, because the annotation is authenticated by the manifest digest.

But we deduplicate individual layers by the TOC digest alone; so it is possible for two images to use a layer with the same TOC digest, but different tar-split locations and contents (e.g. one with an added set-UID bit somewhere).

Locally, it seems by far easiest to move the tar-split digest into the TOC. That way the by-now-widespread assumption that the TOC digest is a sufficient identifier can be maintained.

For other purposes, like the BlobInfoCache and metadata-only reuse, the value returned from toc.GetTOCDigest is used both for reuse/match lookups, so whatever this function returns must be a complete identifier; it should be unambiguous, and we can never add more annotations that change the semantics of a layer with a matching GetTOCDigest value. Even if we replaced GetTOCDigest now with something that accounts for tar-split, we need this guarantee to be forward-compatible (consider pulling an image just before a new annotation type is added, upgrading, and then pulling another just after). I think that also argues for a long-term design of keeping everything anchored to a single digest value.


A bit more generally, I’ve been wondering whether the TOC digest is a good identifier long-term.

The way things are now, we basically can’t significantly change the format of the TOC, otherwise we risk the possibility of the same TOC blob being valid for both the old and new formats, allowing an attacker to trigger deduplication between layers with the same TOC digest but different TOC formats and differently-understood contents.

For example, we just can’t support a new partial-pull format with some other JSON-based TOC format if it could be constructed ambiguously with the zstd:chunked TOC format. I’m not 100% sure this is worth worrying about — but if it were ever to happen, we would need a fairly large restructuring, or at least renaming/re-documenting. So I’m raising it right now, in case such things were expected. (E.g. will this be required for composeFS?)

@mtrmac mtrmac changed the title zstd:chunked TarSplitChecksumKey not used in a layer ID zstd:chunked blocker: TarSplitChecksumKey not used in a layer ID Apr 13, 2024
@mtrmac
Copy link
Collaborator Author

mtrmac commented Apr 13, 2024

Maybe one way to think about this is that I think it would be clean if #1887 were extended further, so that

  • (some future variant of) toc.GetTOCDigest returned a “complete” format+identity identifier (of a kind which is long-term stable across new added feature and new format variants)
  • that complete identifier were passed to chunked.GetDiffer, which just “executed” it without making any decision what format/variant to consume
  • that complete identifier were also used in everywhere we currently use “TOC digest”, i.e. in c/storage metadata/lookups, c/image BlobInfoCache as in WIP: Record (TOC digest → DiffID) mapping in BlobInfoCache image#2321, and so on.

Yes, I realize it’s very late to change all this. But if we knew that we would need to do it in the near future, I think there’s a strong argument to do it ASAP.

@giuseppe
Copy link
Member

The way things are now, we basically can’t significantly change the format of the TOC, otherwise we risk the possibility of the same TOC blob being valid for both the old and new formats, allowing an attacker to trigger deduplication between layers with the same TOC digest but different TOC formats and differently-understood contents.

how can this happen? If we modify the TOC in any way, isn't safe to just assume the digests are different and they are treated as two different layers? A layer could have two variants with different TOCs, even today just by sorting the entries in a different way but they are treated as two different entities. Isn't this enough?

@mtrmac
Copy link
Collaborator Author

mtrmac commented Apr 15, 2024

Consider things like

{
   "version": 1,
   "entries": [/*v1 entries */]
   "someOtherVendorVersion": "2.0",
    "someOtherVendorEntries": [/*…*/]
}

A v1 consumer will see a valid v1, a sameOtherVendor consumer will see a valid v2. But they are the same blob = same TOC digest = right now, same c/storage layer.

And which consumer is used probably depends on MIME type and annotations in the manifest.

Or, alternatively, consider what if “other” version uses a non-JSON format where it happens to be possible to make a single file valid in both formats.

(To be fair, the existing "version" field is a fairly strong countermeasure if everything uses JSON. Maybe we can make that a requirement.)

@mtrmac
Copy link
Collaborator Author

mtrmac commented Apr 15, 2024

  • (some future variant of) toc.GetTOCDigest returned a “complete” format+identity identifier (of a kind which is long-term stable across new added feature and new format variants)

An unambiguous hash of (layer MIME type, all annotations) might work. Assuming manifests have per-layer MIME types and annotations :)

(We could also add the layer digest.) Of course the more values we add into the identifier, the fewer opportunities for reuse.

@giuseppe
Copy link
Member

Would it be enough to just prepend the format used to the hashed data? e.g. the TOC Digest can be sha256sum("zstd:chunked:$version" + $tocData), that should be enough to protect from the same TOC being used by different formats. What do you think?

In practice the metadata file used by estargz and zstd:chunked is almost the same, so stuff like chunk offset is represented in the same way. There is not a way that some metadata is used by one of the two formats and ignored by the other, so in principle these two cannot conflict.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Apr 18, 2024

To be a bit more explicit, the primary thing that I need solving here is that TarSplitChecksumKey is not participating in the TOC digest. That can be done by hashing the annotations, or by moving TarSplitChecksumKey from the annotations into the TOC.


I agree the “ambiguous TOC format” concern is, short term, not urgent, because there is no ambiguity between estargz and zstd:chunked.

Quite reasonably we can just let things stay as they are, and decide to deal with that if/when a possibly-ambiguous TOC format is ever added.

OTOH if we can solve both the TarSplitChecksumKey and format ambiguity cheaply in one place, I wouldn’t mind doing that now.

I’m bringing this up primarily to benefit from your expertise about ComposeFS etc. — is a new TOC format something you expect?


WRT the specifics, sha256sum("zstd:chunked:$version" + $tocData), it would be more efficient if the identifier could be computed just from the manifest data, without having to read the TOC itself (extra I/O and extra roundtrips, + the TOC can be somewhat large). So sha256sum(unambiguousTuple("zstd:chunked", $version, $tocDigest)) would be better than raw $tocData. All that is assuming there are no other security-relevant annotations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants