Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define less ambiguous UnixFS spec in TS types #265

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Gozala
Copy link

@Gozala Gozala commented Jan 31, 2022

I would like to include less ambiguous UnixFS specification coded as TS interface definition so that:

  1. It can capture more invariant than possible via provided protobuf schema.
  2. It is clear which fields are required in which node types.
  3. It is more clear how file shards, chunks & roots are represented (in terms of necessary fields and link types)

You may notice several @TODO comments as despite my best efforts I still have gaps in my understanding. I would love if you could help me fill those by providing inline comments and/or suggestions.

Despite my best effort, I may still be misunderstanding and incorrectly encoding / describing things, if so I would really appreciate comments and/or suggestions to correct those errors.

I would like to iterate on this document and eventually land this for everyone who may try to get a better understanding of the UnixFS format.

@Gozala
Copy link
Author

Gozala commented Jan 31, 2022

Got no permissions to request review so instead I'm tagging people by names @aschmahmann @achingbrain @rvagg @alanshaw @ribasushi

@achingbrain
Copy link
Member

I'm not sure introducing typescript into the spec is a good idea? We already have prose and protobuf as definition languages, adding a third just increases the probability of contradictions between them all.

It seems like the problem this is aiming to solve is that lots of fields can be set, most of them are optional and it's not clear which ones should be set and when, and what to do when they are or are not - can we not just make the clarifications by updating the body of the text & adding comments to the protobuf parts?

unixfs.ts Outdated Show resolved Hide resolved
unixfs.ts Outdated Show resolved Hide resolved
unixfs.ts Outdated Show resolved Hide resolved
unixfs.ts Outdated Show resolved Hide resolved
unixfs.ts Outdated
Comment on lines 386 to 389
/**
* Directory link SHOULD specify size of the entry.
*/
Tsize: uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and so it should be optional?

Suggested change
/**
* Directory link SHOULD specify size of the entry.
*/
Tsize: uint64
/**
* Directory link SHOULD specify size of the entry.
*/
Tsize?: uint64

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it should not, should it from what I can tell all the Dir links seem to report sizes and as far as I can tell that is size of the block they point at not the size of the content.

unixfs.ts Outdated
Comment on lines 310 to 314
* TODO: Have not came across this one either, I'm not sure how it supposed
* to be represented. If not used in practice maybe it sholud be marked
* deprecated.
*/
Symlink = 4,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you IPFS add a directory with a symlink in it you should get one of these in your DAG.

Check out dave.txt in https://ipfs.io/ipfs/QmUGzmqt6pcUnghcu68UtkaR8i2r9BGuKQFUNbR1yVTsH8

Screenshot 2022-02-01 at 14 59 04

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alanshaw! After digging through code and tests a bit more I got an impression that Symlinks just encode path to the target in a data. I'm still not sure if blockSizes is just js-unixfs normalization artifact or if they are expected to be encoded in the dag. I also came across the following test

https://github.com/ipfs/js-ipfs-unixfs/blob/master/packages/ipfs-unixfs/test/unixfs-format.spec.js#L439-L446

Which refers to @daviddias's comment
ipfs-inactive/js-ipfs-unixfs-engine#3 (comment)

In the closed issue which points to dead links.

So does anyone know what fliesize in Symlink should report ?

unixfs.ts Outdated
Comment on lines 275 to 277
export type DirectoryNode =
| FlatDirectoryLayout
| AdvancedDirectoryLayout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the same as DirectoryLayout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, thanks for the catch

unixfs.ts Outdated
Comment on lines 43 to 48
/**
* Raw nodes MUST not have any links, yet empty `Links` list is expected.
* At the type level it is expressed as `never[]` which guarantees that
* no instatiation other than empty will satisfy this constraint
*/
Links: never[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct here?

Copy link
Author

@Gozala Gozala Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct here?

Comment about types is accurate, however I'm not sure if the claim itself that it should be empty is accurate. According to dag-pb spec Links should be present even if empty as per:
https://ipld.io/specs/codecs/dag-pb/spec/#constraints

Links:

  • must be present, even if empty; the binary form makes no distinction between an empty array and an omitted value, in the Data Model we always instantiate an array.
  • elements must be sorted in ascending order by their Name values, which are compared by bytes rather than as strings.
    • Names must be unique or be omitted.

If leaf nodes can actually have links, would be great to point that out and also elaborate and provide an example when / what to do with those.

As far as I understood if there is a data it's leaf node and has no links if it does have links it has no data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for this is to do with the ambiguity of present and not-present in the translation between PB and the data model—you just can't distinguish; so the spec makes it just one way, always present, so there is only one data model form for any given PB form, and not multiple data model forms that can be turned into the same PB bytes.

optionality here doesn't save PB bytes btw, it's the same either way, this is just about 1:1 between in-memory representation and over-the-wire bytes.

unixfs.ts Outdated
/**
* Logical representation of a file chunk.
*
* TODO: Clarify when this represenation is used as opposed to `FileChunk`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the leafType option of the importer and I believe unused unless you specify it. If you specify raw for leafType you'll get these type of nodes at the leaves of the DAG.

https://github.com/ipfs/js-ipfs-unixfs/tree/master/packages/ipfs-unixfs-importer#api

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of this, although I don't see corresponding option in ipfs CLI and generally don't understand pros/cons between choosing file vs raw for chunks.

Is one of them legacy and mostly for backwards compat ? Or is there legitimate reason to have both and one should / would choose one over the other ?

@Gozala
Copy link
Author

Gozala commented Feb 1, 2022

I'm not sure introducing typescript into the spec is a good idea? We already have prose and protobuf as definition languages, adding a third just increases the probability of contradictions between them all.

I think it's a lot easier to resolve these contradictions canonically here than hope that those will not arise in the implementations. If there was a language neutral way to specify this I would definitely prefer that over TS, but I don't believe one exists and I would much rather have something that:

  1. I can run program against to verify.
  2. Translate to my language.

My motivation for writing this is precisely protobufs inability to disambiguate, even after going through the both JS and Go implementations I'm not certain I've captured it all accurately. I don't think it's lack of my abilities at fault, I think things are just too vaguely specified.

It seems like the problem this is aiming to solve is that lots of fields can be set, most of them are optional and it's not clear which ones should be set and when, and what to do when they are or are not - can we not just make the clarifications by updating the body of the text & adding comments to the protobuf parts?

I don't think it's one or the other, I would like to do both. It's just in type notation you have a well defined structure and you can be a lot more precise. If it is decided this file should not live here, that's ok with me too, but I would still love to make sure it is (and therefor my interpretation of spec is) accurate. I have a full intention to incorporate this as prose as well.

Gozala and others added 2 commits February 1, 2022 11:18
unixfs.ts Show resolved Hide resolved
unixfs.ts Outdated Show resolved Hide resolved
unixfs.ts Outdated
/**
* Number of bytes in Data field
*/
filesize: uint64
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can symlinks have mtime / mode ? They also seem to have blocksizes

Are they supposed to or is that artifact of normalization no js-ipfs-unixfs side ?

image

unixfs.ts Outdated
/**
* @TODO
*/
export type MetadataNode = never
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can anyone please help clarify what these nodes should contain and what supposed to link to them ? Spec seems to talk about all the things that were considered and ultimately rejected, but it seems to fails to mention what the consensus is.

Is Metadata deprecated legacy in favor of mtime, mode or ?

@rvagg
Copy link
Member

rvagg commented Feb 2, 2022

As an FYI, @Jorropo identified an interesting quirk in go-ipfs's (more likely go-unixfs's) current leaf building which will leave an empty Name where it's not necessary: ipfs/kubo#8691

It's more of a hash consistency concern than anything, we have plenty of data in the wild with blank Name and we actively test all the dag-pb variations identified in the dag-pb spec (see the fixtures starting with "dagpb" @ https://github.com/ipld/codec-fixtures/tree/master/fixtures). It would be nice to bias toward not doing this where possible, so not locking in such an expectation in any of our specs.

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems mostly fine to me, I'm not a big fan of locking in Tsize or Name in the dag-pb forms, but I can see what you're trying to do is build types for cases where they really should exist, leaving the base PBNode and PBLink matching the dag-pb spec.
I'm not as well versed on the UnixFS portions, although I'd be willing to bet there's some surprises wrt optionality in particular if you went digging through existing data and historical code! But sometimes the best way to do that is lock down a spec and see who/what yelps (that's certainly been my experience!).

unixfs.ts Outdated Show resolved Hide resolved
unixfs.ts Outdated Show resolved Hide resolved
unixfs.ts Outdated Show resolved Hide resolved
@Gozala
Copy link
Author

Gozala commented Feb 4, 2022

Few things I'd discovered in the realm of unspecified dags

ipfs ls /ipfs/Qme1Cyu7ujqn3dRkRGmeTLpHJgbGHFjKmud48XK5W8qA6h
Error: only murmur3 supported as hash function


ipfs dag get /ipfs/Qme1Cyu7ujqn3dRkRGmeTLpHJgbGHFjKmud48XK5W8qA6h 
{"Data":{"/":{"bytes":"CAU"}},"Links":[]}

# this is what went into that { Type: Data.DataType.HAMTShard }

ipfs ls /ipfs/QmS73FCYEH2NaSypVDE8EzQsjXoGTa8Wxq1bC22LfYnbiR                                                                ✱
Error: hamt size should be a power of two

# This is what went into that { Type: Data.DataType.HAMTShard, hashType: 0x22 }

ipfs ls /ipfs/Qma5kEnM5fEKTXrFC5zXYRy5QG3hcMWopoFS7ijhxx19qc
{"Data":{"/":{"bytes":"CAUoIjCAAg"}},"Links":[]}

# This is what went into that { Type: Data.DataType.HAMTShard, hashType: 0x22, fanout: 256 }
# looks like leaving out bitfield is fine but not other two fields

For what it's worth js-unixfs will encode all of the above just fine.

@Gozala
Copy link
Author

Gozala commented Feb 4, 2022

Here is another fun example, illustrating that filesize can't be trusted and probably should not be even encoded as it could be derived from Data + blocksizes.

echo '{ "Data": "CAISC2hlbGxvIHdvcmxkGAU=", "Links": [] }' | ipfs object put --datafieldenc base6
ipfs files stat /ipfs/QmPivGwShRuUC9gB17RakWKhixsTU5PMvxBZjXfXxo4KcU                                                        ✱
QmPivGwShRuUC9gB17RakWKhixsTU5PMvxBZjXfXxo4KcU
Size: 5
CumulativeSize: 19
ChildBlocks: 0
Type: file
ipfs cat /ipfs/QmPivGwShRuUC9gB17RakWKhixsTU5PMvxBZjXfXxo4KcU | wc -c
11
ipfs cat /ipfs/QmPivGwShRuUC9gB17RakWKhixsTU5PMvxBZjXfXxo4KcU                                                               ✱
hello world

@rvagg
Copy link
Member

rvagg commented Feb 7, 2022

Maybe we should make this a TODO in GitHub rather than hiding it in code: https://github.com/ipfs/go-unixfs/blob/5e0e095d575b5b5fbf2fea386ea11ab737e4395e/unixfs.go#L260-L265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants