Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content aware chunking #508

Open
Jorropo opened this issue Nov 6, 2023 · 1 comment
Open

content aware chunking #508

Jorropo opened this issue Nov 6, 2023 · 1 comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature P3 Low: Not priority right now

Comments

@Jorropo
Copy link
Contributor

Jorropo commented Nov 6, 2023

One hard thing with deduplication over WORM merkle-tree datastructures like unixfs is that we are extremely limited when finding backrefs. Usually deduplication code is able to search for backrefs freely and fixup the old and new datastructure so that they point to a shared piece of data.
Here we can't, consider this case:

# file 1
| chunk1       |
| aaaabbbbcccc | 
# file 2
| chunk1 |
| bbbb   | 

Assuming file 1 already exists we can't do any deduplication when chunking file 2.
We would have needed to chunk file 1 this way originally:

# file 1
|  c1  |  c2  |  c3  |
| aaaa | bbbb | cccc | 

However when we are chunking file 1 we don't already know what file 2's structure is.


It would be nice to have a chunker which use applicative usecase to do so.
We could try to match known file formats and change strategy based on the file.
We should apply this logic recursively based on contextual hints.

Various strategy I can think off (mainly container formats):

  • tar, cpio, iso, overlay2fs (used by docker), chunk each file recursively, embed headers in the root blocks.
    Deduplicate files inside and outside the archive. As well as deduplicate partial updates where only some files in the archive changed.
  • car, chunk each block using raw leaves, embed headers in the root blocks.
    Deduplicate blocks inside and outside the car. As well as support partial updates where only some blocks in the archive changed. Pinning a car should be the same as pinning the content of this car plus some metadata pixie dust.
  • video containers, chunk streams and sections recursively.
    Deduplicate alternative tracks like audio or captions, deduplicate across splices (*if the container allow low level concatenation)
    Maybe would require a custom video encoder which takes care of reusing bytes as-is in the most amount of possible cases to be really effective.
  • video codecs, chunk around keyframe and delta
    Avoid read amplifications when seeking.
  • gzip, zstd, (other low level concatenable compression streams), chunk around sections
    Formats like stargz compress each file within an archive individually and then concatenate all the resulting streams. This works because for theses decompress(compress(a) + compress(b)) is equal to decompress(compress(a + b)).
    Deduplicate compressed files inside and outside the archive, as well as partial updates where only some files changed.
  • zip, jar, chunk compressed sections individually
    Deduplicate partial updates where only some files in the archive changed.
  • ... ideas welcome

The chunker would need to be resistant and tolerant, if a file is incorrectly classified or does not follow the underlying format correctly. The chunker must still produce a valid unixfs file which once deserialized gives byte for byte identical copy of the original input.
In other words, if someone tries to add an invalid tar file, the chunker must create a unixfs file which encode this invalid tar file.


This solution is simpler than allowing arbitrary backrefs in other blocks because it ensure files don't grow by adding more backrefs. We don't want to find ourself in the situation where a file is bigger to download from scratch than a non de-duped alternative because we assumed that a bunch of blocks would already be cached.
It also does not require synchronizing matches, we can have deduplication without ever comparing old and new versions.


In this case we are not only considering where are leaves are placed but also how theses are linked together.
If I add a tar it's not good enough that it use the same leaves as the file outside of the tar, we want pinning the tar to be equivalent to also pin the file outside of the tar, so the root cid of the file must be linked from the tar's unixfs dag.


The resulting files would be compatible with existing unixfs decoder and not require out-of-band file format knowledge, this is purely an improvement in how we convert from non unixfs to unixfs.
The goal is not to add a webm or tar codec which everyone would need to be updated to understand.

@Jorropo Jorropo added P3 Low: Not priority right now kind/enhancement A net-new feature or improvement to an existing feature labels Nov 6, 2023
@hacdias
Copy link
Member

hacdias commented Dec 14, 2023

Related to ipfs/kubo#3604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature P3 Low: Not priority right now
Projects
None yet
Development

No branches or pull requests

2 participants