Publish UnixFS specifications at specs.ipfs.tech #331

Jorropo · 2022-10-10T15:05:35Z

This need some touch up, (Table of Content, fixtures, ...) but I would to gather feedback first.

lidel

Thank you @Jorropo! I know how time-consuming spelunking code and writing specs is, this is effort is very appreciated ❤️

Did a first pass read and dropped some comments – mostly questions about things I did not know and trying to confirm I understood them right + flagging sections we should research or reorder / rephrase.

UNIXFSv1.md

lidel · 2022-12-02T19:09:40Z

UNIXFSv1.md

+
+- Node, Block
+  A node is a word from graph theory, this is the smallest unit present in the graph.
+  Due to how unixfs work, there is a 1 to 1 mapping between nodes and blocks.


HAMT node will be backed by more than one block, maybe rephrase?

Isn't a HAMT a concatenation of directories ?

I don't see why you couldn't reuse lower parts of the HAMTs (*assuming you find colliding hashes).

lidel · 2022-12-02T22:19:04Z

UNIXFSv1.md

+####### Link ordering
+
+The cannonical sorting order is lexicographical over the names.
+
+In theory there is no reason an encoder couldn't use an other ordering, however this lose some of it's meaning when mapped into most file systems today (most file systems consider directories are unordered-key-value objects).
+
+A decoder SHOULD if it can, preserve the order of the original files in however it consume thoses names.
+
+However when some implementation decode, modify then reencode some, the orignal links order fully lose it's meaning. (given that there is no way to indicate which sorting was used originally)


nit: rephrase this as clear MUST (always creating sorted data) and SHOULD (try parsing data in original order, if possible) for implementers.

I don't understand what you mean, @rvagg fixed the go implementations after the linux2ipfs incident.

They support reading "unsorted" data, however will resort them lexicographically if you modify it.

lidel · 2022-12-02T22:22:01Z

UNIXFSv1.md

+
+####### Path Resolution
+
+Pop the left most component of the path, and try to match it to one of the Name in Links.


Suggested change

Pop the left most component of the path, and try to match it to one of the Name in Links.

Pop the left most component of the path after the current root, and try to match it to one of the Name in Links.

In my internal representation the root have already been poped when downloading the root.

lidel · 2022-12-02T22:29:22Z

UNIXFSv1.md

+
+Pop the left most component of the path, and try to match it to one of the Name in Links.
+
+<!--TODO: check Kubo does this-->If you find a match you can then remember the CID. You MUST continue your search, however if you find a match again you MUST error.


Is search strategy left to implementers? I feel we should suggest something other than iterating over Links list that is "kinda expected to be sorted lexicographically, but may not be".

Knowing what Kubo and JS do will be useful.

lidel · 2022-12-02T22:37:45Z

UNIXFSv1.md

+  optional string Name = 2;
+
+  // cumulative size of target object
+  optional uint64 Tsize = 3; // also known as dagsize


is dagsize a thing, or just more friendly label created for these specs?
Perhaps replacing dagsize and DagSize with "Tsize (DAG size)" will be more clear?

UNIXFSv1.md

marten-seemann · 2022-12-03T04:23:30Z

UNIXFSv1.md

+```
+
+The two different schemas plays together and it is important to understand their different effect,
+- `dag-pb` also named `PBNode` is the "outside" protobuf message, it is the first one you decode. It contain the list of links and some "opaque user data".


Suggested change

- `dag-pb` also named `PBNode` is the "outside" protobuf message, it is the first one you decode. It contain the list of links and some "opaque user data".

- `dag-pb` also named `PBNode` is the "outside" protobuf message, it is the first one you decode. It contains the list of links and some "opaque user data".

Piggybacking off of this, first bullet point suggestion:

- The `dag-pb` protobuf is the "outside" protobuf message; in other words, it is the first message decoded. This protobuf contains the list of links and some "opaque user data".

Also, as a noob reader, I wouldn't be clear what you mean by "opaque user data". Might be good to clarify this

Also, I'd suggest moving this callout dag-pb also named PBNode up to line 79, since that's the first dag-pb is mentioned.

marten-seemann · 2022-12-03T04:25:21Z

UNIXFSv1.md

+- Symlink
+  This represent a POSIX Symlink.<!--TODO: Add link to POSIX spec.-->
+
+### Paths


Should this paragraph explain how this maps onto the protobufs?

It doesn't. (I attempted to convey this because of the level 2 ## Paths)

This attempt to explain how you should read this/is/a/path as []string{"this", "is", "a", "path"}.

marten-seemann · 2022-12-03T04:26:06Z

UNIXFSv1.md

+  // binary CID (with no multibase prefix) of the target object
+  optional bytes Hash = 1;
+
+  // UTF-8 string name


iiuc

Suggested change

// UTF-8 string name

// UTF-8 string name, used for pathing

Thx, but is it really UTF-8 ? Some current mainstream implementations does not prevent users from using arbitrary bytes in their file names (as long as they don't contain 0x2f)

(see utf8 war in dag-cbor ...)

Also this is copied straight from dag-pb spec is not an authoritative section, I don't think I should update this.

marten-seemann · 2022-12-03T04:26:36Z

UNIXFSv1.md

+}
+
+message PBNode {
+  // refs to other objects


Can you explain what links are used for in UnixFS?

It's kinda what the whole dag-pb section bellow is dedicated to.
This is just some protobuf definition, not meant to self documenting code.

UNIXFSv1.md

marten-seemann · 2022-12-03T04:36:58Z

UNIXFSv1.md

+`node.Data.Data` is some bitfield, ones indicates weather or not the links are part of this HAMT or leaves of the HAMT.
+The usage of this field is unknown given you can deduce the same information from the links names.
+
+###### Path resolution on HAMTs


With my very limited knowledge of HAMTs, I'm having trouble understanding the problem and how it is solved.

marten-seemann · 2022-12-03T04:38:53Z

UNIXFSv1.md

+  - Implementations encoding or decoding wire-representations must observe the following:
+    - An `mtime` structure with `FractionalNanoseconds` outside of the on-wire range `[1, 999999999]` is **not** valid. This includes a fractional value of `0`. Implementations encountering such values should consider the entire enclosing metadata block malformed and abort processing the corresponding DAG.
+    - The `mtime` structure is optional - its absence implies `unspecified`, rather than `0`
+    - For ergonomic reasons a surface API of an encoder must allow fractional 0 as input, while at the same time must ensure it is stripped from the final structure before encoding, satisfying the above constraints.


Why are you specifying the API here? This sounds like it makes in Go, but this might not apply to other languages.

i dont i copied the metadata spec over, someone else already spécified that.

UNIXFSv1.md

aschmahmann

Glad this is moving along 🎉.

Spec documents where things unrelated to the spec such as implementation details, alternatives considered, etc. are intertwined with the spec such that they're hard to distinguish is painful. The UnixFS spec is confusing enough without these extra distractions, many of which came from the previous version of the spec, so let's drop them. If people want to keep them around then moving them somewhere separate (e.g. to an appendix) would be great.

UNIXFSv1.md

aschmahmann · 2022-12-04T03:10:57Z

UNIXFSv1.md

+
+A so called "block limit" is in place, we do not allow any single block to be bigger than 2MiB.
+
+Implementation SHOULD try to not emit 1MiB bigger blocks, but MUST decode blocks <= 2MiB.


This is probably not the right place to discuss this

Agreed, as mentioned above I don't think the UnixFS spec is the place to discuss the block limit. We can add an IPLD/block limit spec though and discuss there.

but 1 MB seems really small.

I have no preference for what the magic number is. However, every number is too small or too big for someone.

I've seen a crew of people who believe having a 100MB (or a non-existent) block limit would make data transfer in IPFS magically fast and that it's a major problem in IPFS data transfer today. This opinion is only valid if you're willing to concede that BitTorrent-v2 is slow (16kib chunks) and standard BitTorrent-v1 settings (256kib) are also slow. I have yet to meet someone who holds both views, but maybe I haven't talked to enough people 🤷.

Can talk about this in another issue. Interested parties may also want to read this thread https://discuss.ipfs.tech/t/supporting-large-ipld-blocks/15093.

cc @aschmahmann do you know where this come from 🙃

@Stebalien would probably know more, but I think the TLDR is people like round numbers and are bad at math.

Longer version: kubo (formerly go-ipfs) had a 1MiB max size for UnixFS chunking.... but you could use 1MiB chunks with the extraneous original protobuf wrapping which bumps it over the limit so you need a new limit. People like round numbers so 2MiB.

Generally speaking people will be bad at math (off by one errors, forgetting about protobuf wrapping, miscounting link sizes/names for directories, ....) if you tell people to go for 1MiB and they mess up and go over a bit things will be fine. If you tell people 2MiB and they go over the limit things get tricky. People will say things like "Bitswap enforce slightly bellow 4MiB" (https://github.com/ipfs/specs/pull/331/files#r1038743285) which might convince someone their 2.1MiB block is fine... but that's only go-bitswap today, but some other services (e.g. web3.storage) hard limit you at 2MiB blocks and go-bitswap could reasonably make that change as well.

For those who insist that the block limit is a critical performance issue, see above.

That being said if people feel the SHOULD is unnecessary and just want the MUST, sure 🤷.

SHOULD in https://www.ietf.org/rfc/rfc2119.txt

This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

That sounds close to accurate to me, given that if you choose to start using the full 2MiB limit without being careful to not exceed the limit you'll run into interoperability problems.

aschmahmann · 2022-12-04T03:12:17Z

UNIXFSv1.md

+	optional uint64 filesize = 3;
+	repeated uint64 blocksizes = 4;
+	optional uint64 hashType = 5;
+	optional uint64 fanout = 6;


I don't see the word fanout mentioned in the rest of this document

It's not used in kubo (kubo does emit it but IMO that is chunker implementation details,
I think we should remove this field and add a comment // field 6 is reserved for backward compatibility and SHOULD NOT be emited by implementations):

$ rgrep -I fanout | grep "\.go:" | grep -v libp2p vendor/github.com/ipfs/go-unixfsnode/data/unmarshal.go: fanout, n := protowire.ConsumeVarint(remaining) vendor/github.com/ipfs/go-unixfsnode/data/unmarshal.go: qp.MapEntry(ma, Field__Fanout, qp.Int(int64(fanout))) vendor/github.com/ipfs/go-unixfsnode/hamt/errors.go: // ErrNoFanoutField indicates the HAMT node's UnixFS structure lacked a fanout field, which is required vendor/github.com/ipfs/go-unixfs/unixfs.go:func HAMTShardData(data []byte, fanout uint64, hashType uint64) ([]byte, error) { vendor/github.com/ipfs/go-unixfs/unixfs.go: pbdata.Fanout = proto.Uint64(fanout) vendor/github.com/ipfs/go-unixfs/unixfs.go:// Fanout gets fanout of format vendor/github.com/ipfs/go-unixfs/pb/unixfs.pb.go: Fanout *uint64 `protobuf:"varint,6,opt,name=fanout" json:"fanout,omitempty"`

I know TSize is also never red however I still talk about it, but it's tricky and easy to get wrong and important to mention that it MUST NOT be used in offset computation.

It's not used in kubo

You sure? https://github.com/ipfs/go-unixfs/blob/707110f05dac4309bdcf581450881fb00f5bc578/hamt/hamt.go#L147-L149

Also https://github.com/ipfs/go-unixfsnode/blob/475ed658c35e67af7793da5a9dc86d57bde24fe3/hamt/util.go#L73-L74

You have to be more careful in the areas of the spec you assume are unused. This one is used in existing unixfs implementations and could be discovered with some cursory looking around in those repos. Your grepping even identified some lines to look at so I'm not sure how you reached the conclusion that the lines were unused and should not be emitted.

@aschmahmann ah mb, I should have done -iI not -I I'll check it out, I guess everything that I assume to be 256 wide in the HAMT are actually variable sized ... Great the hamt is more complex than I thought!

I know TSize is also never red

I think you need to do more poking around at these areas of the spec where people are asking for clarification. This is incorrect as well. A bit of looking using the GitHub UI shows at the very least:

https://github.com/ipfs/kubo/blob/4d4841f41cdc2d797e87f7b62c230ee957513f94/core/commands/files.go

There may be more examples as well, but saying it's never read is incorrect.

@aschmahmann I don't see TSize nor DagSize nor Fanout in the last file you linked it's using Filesize.

@aschmahmann I don't see TSize nor DagSize

@Jorropo https://github.com/ipfs/kubo/blob/4d4841f41cdc2d797e87f7b62c230ee957513f94/core/commands/files.go#L249

If you'd like to see it in action (commands using pwsh, but translate to whatever shell you want):

❯ ipfs name resolve /ipns/ipfs.io | %{ipfs files stat $_} QmegA7HiEvLmyJgVcBxgZ2hjEp5YZ4aVxcjBdHcKvD2f73 Size: 0 CumulativeSize: 10776742 ChildBlocks: 16 Type: directory ❯ ipfs name resolve /ipns/ipfs.io | %{ipfs files stat $_/index.html} Qmf5nTcgHNZ4jB29d4XN7JhPykZMpCGNQysEcXjtRYguwW Size: 190713 CumulativeSize: 190727 ChildBlocks: 0 Type: file

You can take a look in go-merkledag (where dagpb nodes are defined) if you're trying to follow the code paths without using any tooling like an IDE or tracing execution paths.

Thx, I can't use my IDE effectively because everything is an interface.

aschmahmann · 2022-12-04T03:52:16Z

UNIXFSv1.md

+
+### SHOULD NOT names
+
+Thoses names SHOULD NOT<!--MUST NOT ? in future revisions--> be used:


Perhaps a bad idea/overkill, but it seems like a lot of the characters that are unfriendly could be marked as SHOULD NOT even if some implementations will let many of them through.

There's a whole slew of bad characters/path components mentioned in ipfs/kubo#1710, but basically we might be doing people some favors by gathering that information so that implementers don't have to go figure it out themselves until they get pressed to by their users. For example, most implementations should probably just error on path components with newline characters until their users start asking for support with some non-troll dataset.

UNIXFSv1.md

willscott · 2022-12-05T01:58:33Z

UNIXFSv1.md

+
+A directory node is a named collection of file.
+
+The minimum valid `PBNode.Data` field for a directory is (pseudo-json): `{"Type":"Directory"}`, other values are covered in Metadata.


In practice, directories seem generally to have some sizing information in their PBNode.Data - maybe worth a SHOULD section of recommended Data for these

Do you have an example CID where it's the case ?
AFAIT Kubo always just set PBNode.Data to unixfsEncode({"Type": "Directory"})

willscott · 2022-12-05T02:00:03Z

UNIXFSv1.md

+
+Currently symlinks are not followable, that mean implementations needs to return symlinks objects and fail if a consumer tries to follow it through.
+
+This is a SHOULD level, you probably wont break much things if you start following them.


It's unclear how to me how the unixfs format rather than the consumer would be able to follow symlinks, since the path doesn't provide a CID destination and the destination context will not be reliably clear at the point of link decoding (e.g. across mount, etc.)

UNIXFSv1.md

thibmeu · 2022-12-05T15:09:17Z

UNIXFSv1.md

+
+### SHOULD NOT names
+
+Thoses names SHOULD NOT<!--MUST NOT ? in future revisions--> be used:


another consideration is to avoid using / in node names, as some tooling (gateway, fs) considers this as a directory separator.

So I was about to say that it's actually fine because we could use \/ or %2F but \/ actually don't work on tmpfs and ext4 on linux so let's not allow that one, nice catch!

Actually this is already covered with a MUST NOT before:

Components MUST NOT contain / unicode codepoints because else it would break the path into two components.

UNIXFSv1.md

thibmeu · 2022-12-05T15:14:45Z

UNIXFSv1.md

+
+They never have any childs, and thus are also known as single block files.
+
+Their size (both `dagsize` and `blocksize`) is the length of the block body.


this is previously refered to as TSize

UNIXFSv1.md

thibmeu · 2022-12-05T15:16:40Z

UNIXFSv1.md

+```
+
+3. Profit
+Assuming we stored this block in some implementation of our choice which makes it accessible to our client, we can try to decode it:


some implementation of our choice of what?
you might want to refer to kubo directly, as an IPFS implementation with a datastore that persists data locally, and implements UnixFS.

I didn't wanted to include poking into flatfs or creating a car file as part of the example and used Assuming we stored this block in some implementation of our choice as a magic exercise left to the reader sentence.

I can add echo -n "test" | ipfs block put if you want.

UNIXFSv1.md

ElPaisano

Made some initial writing suggestions here, please lmk if it's helpful @Jorropo. I didn't get to everything. I can make another pass shortly. I know this is a work-in-progress.

UNIXFSv1.md

ElPaisano · 2022-12-06T16:34:04Z

UNIXFSv1.md

+
+### IPLD `dag-pb`
+
+A very important other spec for unixfs is the [`dag-pb`](https://ipld.io/specs/codecs/dag-pb/spec/) IPLD spec:


Phrasing suggestion:

The IPLD [`dag-pb`](https://ipld.io/specs/codecs/dag-pb/spec/) spec (also known as `PBNode`) is also used by unixfs, and is represented by the following protobuf:

I suggested moving the callout that dag-pb is also called PBNode up here, since it's the first time the reader encounters the term dab-pb in this doc

ElPaisano · 2022-12-06T16:37:52Z

UNIXFSv1.md

+}
+```
+
+The two different schemas plays together and it is important to understand their different effect,


Phrasing suggestion:

Each protobuf schema plays a different role in unixfs. These differences are described below:

ElPaisano · 2022-12-06T16:52:31Z

UNIXFSv1.md

+```
+
+The two different schemas plays together and it is important to understand their different effect,
+- `dag-pb` also named `PBNode` is the "outside" protobuf message, it is the first one you decode. It contain the list of links and some "opaque user data".


Piggybacking off of this, first bullet point suggestion:

- The `dag-pb` protobuf is the "outside" protobuf message; in other words, it is the first message decoded. This protobuf contains the list of links and some "opaque user data".

Also, as a noob reader, I wouldn't be clear what you mean by "opaque user data". Might be good to clarify this

ElPaisano · 2022-12-06T16:55:17Z

UNIXFSv1.md

+```
+
+The two different schemas plays together and it is important to understand their different effect,
+- `dag-pb` also named `PBNode` is the "outside" protobuf message, it is the first one you decode. It contain the list of links and some "opaque user data".


Also, I'd suggest moving this callout dag-pb also named PBNode up to line 79, since that's the first dag-pb is mentioned.

ElPaisano · 2022-12-06T20:21:22Z

UNIXFSv1.md

+They are always of type file.
+
+They can be recognised because their CIDs have `Raw` codec.
+
+The file content is purely the block body.
+
+They never have any childs, and thus are also known as single block files.
+
+Their size (both `dagsize` and `blocksize`) is the length of the block body.


Suggestion: use bullet points, phrasing and grammar

- They are always of type `file`. - Their CIDs have a `Raw` codec. - The file content is the block body. - They never have any children nodes, and thus are also known as single block files. - Both the `dagsize` and `blocksize` fields specify the length of the block body.

Their CIDs have a Raw codec.

I don't agree with this sentence, here I understand Raw codecs in CIDs as a property of Raw nodes while it's an implication.

Both the dagsize and blocksize fields specify the length of the block body.

I'm not native english I need a check on this, when I read this I understand this as the wrong way around, I understand that dagsize and blocksize → length of the block body.
While in reality it is dagsize and blocksize ← length of the block body.

UNIXFSv1.md

ElPaisano · 2022-12-06T20:28:21Z

UNIXFSv1.md

+
+####### The sister-lists `PBNode.Links` and `decodeMessage(PBNode.Data).blocksizes`
+
+The sister-lists are the key point of why `dag-pb` is important for files.


A few thoughts here:

format sister-list using italics since it's a term i.e.

_sister-list_

Include the example from your comment https://github.com/ipfs/specs/pull/331/files#r1038744725 below this sentence so the reader knows what a sister-list is

Suggestion:

_Sister-lists_ are a key reason why `dag-pb` is important for files. In the following example, the `PBNode.Links` and `PBNode.Data.blocksizes` slices are sisters. This means that they must have the same length and each map to the same entity at the same index. In other words, instead of having two lists of properties, we have a single list of properties where some of the properties are stored in `PBNode.Links[n]`, and other properties of the same object are stored in PNode.Data.blocksizes[n]

type PBNode struct { Links []struct{ tsize uint64 hash cid.Cid } Data struct{ blocksizes []uint64 } }

ElPaisano · 2022-12-06T20:36:31Z

UNIXFSv1.md

+This allows us to concatenate smaller files together.
+
+Linked files would be loaded recursively with the same process following a DFS (Depth-First-Search) order.
+


Suggestion:

_Siter-lists_ allow us to concatenate smaller files together. Otherwise, linked files would be loaded recursively during concatenation following Depth-First-Search order.

I might not be understanding what you're trying to convey here, lmk

The sister lists is relevant to ranging. the DFS is the default mode when you don't do ranging (such as while fetching the complete file or if your range cover a multiple blocks in the file)

2color · 2023-01-12T18:16:32Z

UNIXFS.md

+- JavaScript
+  - Data Formats - [unixfs](https://github.com/ipfs/js-ipfs-unixfs)
+  - Importer - [unixfs-importer](https://github.com/ipfs/js-ipfs-unixfs-importer)
+  - Exporter - [unixfs-exporter](https://github.com/ipfs/js-ipfs-unixfs-exporter)


Should we also add
https://github.com/ipld/js-unixfs

What about https://github.com/web3-storage/fast-unixfs-exporter

I just learned of these in the last couple of days and from what I understand they are actively maintained and used by DAGHouse

https://github.com/ipld/js-unixfs ✅ - yes we use this for encoding UnixFS DAGs now
https://github.com/web3-storage/fast-unixfs-exporter ❌ - temporary fork now deprecated

BigLep · 2023-01-21T05:55:28Z

@ElPaisano : are you able to review, particularly for a grammar/organization regard?

John-LittleBearLabs · 2023-02-23T18:18:58Z

UNIXFS.md

+It MUST be murmur3-x64-64 (multihash `0x22`).
+- `node.Data.Data` is some bitfield, ones indicates whether or not the links are part of this HAMT or leaves of the HAMT.
+The usage of this field is unknown given you can deduce the same information from the links names.
+- `node.Data.fanout` MUST be a power of two. This encode the number of hash permutations that will be used on each resolution step.


Is there any reason someone would choose a power of 2 that's not a power of 4? Just thinking about encoding into hex, where there's 4 bits to a digit.

John-LittleBearLabs · 2023-02-23T18:22:32Z

UNIXFS.md

+Thoses nodes are also sometimes called sharded directories, they allow to split directories into many blocks when they are so big that they don't fit into one single block anymore.
+
+- `node.Data.hashType` indicates a multihash function to use to digest path components used for sharding.
+It MUST be murmur3-x64-64 (multihash `0x22`).


Do we want to have backward compatibility? It would seem the most prominent implementations are only taking the first 64 bits of the digest into account.
Maybe replace "lowest" in step 2 down on line 259 with a detailed description of how to pull out the correct bits from the middle of the 128-bit digest, in which case you could be backward-compatible with existing data and still allow the longer digest going forward (for large fanout and/or many, many, many entries).

src/architecture/unixfs.md

ElPaisano

Minor cosmetic and writing things from my read through.

src/architecture/unixfs.md

lidel

@hacdias this is a quick pass with some pending notes I had (some need to be verified)

lidel · 2023-10-04T12:12:44Z

src/architecture/unixfs.md

TODOs/asks from my old notes:

Add section about Inlining. Have a "rule of thumb" ("SHOULD") around inlining small blocks should be part of the spec, including a maximum block size that makes sense to inline (e.g. 32)

Add section with "Test Vectors".

List CIDs of empty directories and zero-length files (with and without raw leaves, with and without inlining)

include/mention vectors from https://github.com/ipld/codec-fixtures/

include/mention vectors from http://ipld.io/specs/transport/trustless-pathing/fixtures/unixfs_20m_variety/

include/mention vectors from https://ipld.io/specs/codecs/dag-pb/fixtures/cross-codec/

lidel · 2023-10-04T12:19:04Z

UNIXFS.md

-[CID]: https://docs.ipfs.io/guides/concepts/cid/
-[Bitswap]: https://github.com/ipfs/specs/blob/master/BITSWAP.md
-[MFS]: https://docs.ipfs.io/guides/concepts/mfs/
+Moved to https://specs.ipfs.tech/architecture/unixfs/


I think we should move this to top level, to avoid breaking links in the future.

We can still adjust tags in front matter to move it between categories, but the permalink is more future-proof this way:

Suggested change

Moved to https://specs.ipfs.tech/architecture/unixfs/

Moved to https://specs.ipfs.tech/unixfs-data-format/

lidel · 2023-10-04T12:24:00Z

src/architecture/unixfs.md

+##### `decode(PBNode.Data).filesize`
+
+If present, this field MUST be equal to the `Blocksize` computation above.
+Otherwise, this file is invalid.


TODO: incorporate comment by @ribasushi from https://www.notion.so/HTTP-Gateway-Requests-for-Graphs-as-CARs-001d2a9f5a35418bb0fb7d9d182d24ec

.Data.Filesize ( field 3 ) is mandatory for types 0 and 2. It is marked as “optional” in the PB is because of the other types ( dir, etc ). When this is unspecified, it defaults to 0, and file becomes a zero-length, thus no bytes.

lidel · 2023-10-04T12:36:16Z

src/architecture/unixfs.md

+<!--TODO: check that this is true-->
+There is no failure mode known for this field, so your implementation should be
+able to decode nodes where this field is wrong (not the value you expect), or 
+partially or completely missing. This also allows smarter encoders to give a
+more accurate picture (Don't count duplicate blocks, etc.).


let's make things even more clear: paid pinning services should not trust the Tsize in client's DAGs is correct:

Suggested change



There is no failure mode known for this field, so your implementation should be

able to decode nodes where this field is wrong (not the value you expect), or

partially or completely missing. This also allows smarter encoders to give a

more accurate picture (Don't count duplicate blocks, etc.).

:::warning

An implementation SHOULD NOT assume the `TSize` values are correct. The value is only a hint that provides performance optimization for better UX.

Following the [Robustness Principle](https://specs.ipfs.tech/architecture/principles/#robustness), implementation SHOULD be

able to decode nodes where the `Tsize` field is wrong (not matching the sizes of sub-DAGs), or

partially or completely missing.

When total data size is needed for important purposes such as accounting, billing, and cost estimation, the `Tsize` SHOULD NOT be used, and instead a full DAG walk SHOULD to be performed.

:::

lidel · 2023-10-04T12:39:12Z

src/architecture/unixfs.md

+An example of where this could be useful is as a hint to smart download clients.
+For example, if you are downloading a file concurrently from two sources that have
+radically different speeds, it would probably be more efficient to download bigger
+links from the fastest source, and smaller ones from the slowest source.


Suggested change

An example of where this could be useful is as a hint to smart download clients.

For example, if you are downloading a file concurrently from two sources that have

radically different speeds, it would probably be more efficient to download bigger

links from the fastest source, and smaller ones from the slowest source.

:::note

Examples of where `Tsize` is useful:

- User interfaces, where total size of a DAG needs to be displayed immediately, without having to do the full DAG walk.

- Smart download clients, downloading a file concurrently from two sources that have radically different speeds. It is more efficient to download bigger

links from the fastest source, and smaller ones from the slowest source.

:::

lidel · 2023-10-04T12:41:04Z

src/architecture/unixfs.md

+
+- The file content is purely the block body.
+- They never have any children nodes, and thus are also known as single block files.
+- Their size (both `dagsize` and `blocksize`) is the length of the block body.


Do we need to invent new thing and name it dagsize here? iiuc it is Tsize, right?

Suggested change

- Their size (both `dagsize` and `blocksize`) is the length of the block body.

- Their size is the length of the block body (`Tsize` in parent is equal to `blocksize`).

lidel · 2023-10-04T12:53:03Z

src/architecture/unixfs.md

+### `TSize` / `DagSize`
+
+This is an optional field in `PBNode.Links[]`. It **does not** represent any
+meaningful information of the underlying structure, and there is no known
+usage of it to this day, although some implementations omit these.
+
+To compute the `DagSize` of a node, which is stored in the parents, sum the length of the `dag-pb` outside message binary length and the `blocksizes` of all child files.


TODO: this needs cleanup, people get Tsize wrong all the time, and the text here is not very clear:

the DagSize (name, field) does not exist and was invented only for this spec, right @Jorropo ?

"no known usage" feels wrong: Tsize is used all over the place, in places like MFS and in every GUI that exists for showing sizes of UnixFS files without traversing entire dag each time, including directory listing at all HTTP Gateways.

Would below be more clear?

Suggested change

### `TSize` / `DagSize`

This is an optional field in `PBNode.Links[]`. It **does not** represent any

meaningful information of the underlying structure, and there is no known

usage of it to this day, although some implementations omit these.

To compute the `DagSize` of a node, which is stored in the parents, sum the length of the `dag-pb` outside message binary length and the `blocksizes` of all child files.

### `TSize` (child DAG size hint)

`Tsize` is an optional field in `PBNode.Links[]` which represents the precomputed size of the specific child DAG. It provides a performance optimization: a hint about the total size of child DAG can be read without having to fetch any child nodes.

To compute the `Tsize` of a child DAG, sum the length of the `dag-pb` outside message binary length and the `blocksizes` of all nodes in the child DAG.

lidel · 2023-10-04T13:00:42Z

src/architecture/unixfs.md

+      name: Protocol Labs
+      url: https://protocol.ai/
+
+tags: ['architecture']


TODO: I think we may want to move this to formats category before this PR is merged.

willscott · 2023-10-12T17:19:04Z

src/architecture/unixfs.md

+
+In UnixFS, a node can be encoded using two different multicodecs, listed below. More details are provided in the following sections:
+
+- `raw` (`0x55`), which are single block :ref[Files].


the partial ranges of data in multi-block files are also encoded as raw nodes, right?

willscott · 2023-10-12T17:22:22Z

src/architecture/unixfs.md

+
+### Metadata
+
+UnixFS currently supports two optional metadata fields.


what is 'supports' - what expectations are there around generation / parsing of these? - we have implementations that don't encode mode fully

BigLep · 2023-10-17T17:07:44Z

2023-10-17 maintainer conversation: this needs someone to comb through the comments. This is a nice-to-have before Instanbul and a must-have before end of year nucleation.

bumblefudge · 2024-01-17T02:58:29Z

2023-10-17 maintainer conversation: this needs someone to comb through the comments. This is a nice-to-have before Instanbul and a must-have before end of year nucleation.

Invite me to the next maintainer meeting and I can maybe scrum it a tiny bit? feels like a PR that's blocking other also-important PRs

lidel · 2024-02-15T23:48:20Z

src/architecture/unixfs.md

+The field `Name` of an element of `PBNode.Links` for a HAMT starts with an
+uppercase hex-encoded prefix, which is `log2(fanout)` bits wide.
+
+##### Path Resolution


TODO: cross check HAMT-specifics with GO and JS:

https://github.com/ipfs/helia/blob/31cdfa8b8990acc9c99a55dd2c078d0c415055ea/packages/unixfs/src/commands/utils/dir-sharded.ts#L213C1-L225C1

https://github.com/ipfs/helia/blob/2eff2c2c4d3eedf83d3b6cd6fce928f29aa60a5a/packages/unixfs/src/commands/utils/find-shard-cid.ts#L9-L17

https://github.com/ipfs/js-ipfs-unixfs/blob/4749d9a7c1eddd86b8fc42c3fa47f88c7b1b75ae/packages/ipfs-unixfs-importer/src/dir-sharded.ts#L169-L177

Jorropo self-assigned this Oct 10, 2022

lidel mentioned this pull request Oct 13, 2022

(DRAFT) Proposed updates to UnixFS section ipfs/ipfs-docs#1297

Closed

BigLep mentioned this pull request Nov 11, 2022

Update UnixFS specification #316

Open

Jorropo force-pushed the unixfs branch from c960bbf to 9ae1573 Compare December 2, 2022 17:56

Jorropo marked this pull request as ready for review December 2, 2022 17:57

Jorropo force-pushed the unixfs branch 2 times, most recently from 53dbc07 to 64c86d3 Compare December 2, 2022 17:59

lidel reviewed Dec 2, 2022

View reviewed changes

marten-seemann reviewed Dec 3, 2022

View reviewed changes

Jorropo commented Dec 3, 2022

View reviewed changes

UNIXFSv1.md Outdated Show resolved Hide resolved

aschmahmann requested changes Dec 4, 2022

View reviewed changes

willscott reviewed Dec 5, 2022

View reviewed changes

UNIXFSv1.md Outdated Show resolved Hide resolved

willscott reviewed Dec 5, 2022

View reviewed changes

thibmeu reviewed Dec 5, 2022

View reviewed changes

ElPaisano reviewed Dec 6, 2022

View reviewed changes

lidel mentioned this pull request Dec 19, 2022

raw block response fails hash verification n0-computer/beetle#98

Closed

Jorropo force-pushed the unixfs branch from 64c86d3 to 6c250e6 Compare January 3, 2023 11:07

Jorropo requested review from lidel, aschmahmann and ElPaisano and removed request for ElPaisano January 3, 2023 11:07

Jorropo mentioned this pull request Jan 5, 2023

fix: update soft block limit to 2MiB ipfs/kubo#8968

Open

ajnavarro requested review from ajnavarro and removed request for ElPaisano January 9, 2023 16:10

2color reviewed Jan 12, 2023

View reviewed changes

lidel mentioned this pull request Jan 12, 2023

Create IPIP: UnixFS and Gateway support for explicit MIME/Content-Type #364

Open

7 tasks

John-LittleBearLabs reviewed Feb 23, 2023

View reviewed changes

lidel mentioned this pull request Mar 30, 2023

Gateway/UnixFS: specify/unify symlink handling #368

Open

Jorropo commented Sep 21, 2023

View reviewed changes

src/architecture/unixfs.md Show resolved Hide resolved

hacdias force-pushed the unixfs branch from cc0658d to a9a65f3 Compare October 3, 2023 11:18

hacdias self-assigned this Oct 3, 2023

hacdias force-pushed the unixfs branch 2 times, most recently from f3a406d to c9d0296 Compare October 3, 2023 11:44

hacdias changed the title ~~Unixfs Reboot~~ Publish UnixFS specifications at specs.ipfs.tech Oct 3, 2023

ElPaisano reviewed Oct 3, 2023

View reviewed changes

hacdias force-pushed the unixfs branch from 3db1968 to 5eb6c3e Compare October 4, 2023 07:40

lidel requested changes Oct 4, 2023

View reviewed changes

willscott reviewed Oct 12, 2023

View reviewed changes

BigLep assigned lidel and hacdias and unassigned hacdias and Jorropo Oct 17, 2023

hacdias and others added 5 commits October 30, 2023 11:01

chore: move UNIXFS.md (preserve history)

1fff418

chore: add UNIXFS.md to link to new website

86b93cf

docs: Write UNIXFSv1 spec

97abffc

chore: editorial fixes

c4e812a

chore: further editorial changes

d2d9f67

hacdias force-pushed the unixfs branch from 5eb6c3e to d2d9f67 Compare October 30, 2023 10:01

chore: apply @ElPaisano suggestions

e2cf0af

BigLep mentioned this pull request Nov 9, 2023

Release 0.24 ipfs/kubo#10043

Closed

11 tasks

lidel mentioned this pull request Nov 27, 2023

Notes and Recommendations for Browser Implementers ipfs/in-web-browsers#213

Open

This was referenced Jan 11, 2024

[wip] Web Pathing Specification: initial outline with TODOs #453

Draft

gateway: run Unicode Normalisation Forms on path gateway inputs #457

Open

lidel reviewed Feb 15, 2024

View reviewed changes

mishmosh mentioned this pull request Apr 3, 2024

Improve data onbaording speed: ipfs add and ipfs dag import|export ipfs/kubo#10383

Open

3 tasks

lidel mentioned this pull request May 17, 2024

Support storing UnixFS 1.5 Mode and ModTime ipfs/kubo#7754

Open

19 tasks


		####### Path Resolution

		Pop the left most component of the path, and try to match it to one of the Name in Links.

	Pop the left most component of the path, and try to match it to one of the Name in Links.
	Pop the left most component of the path after the current root, and try to match it to one of the Name in Links.


		Pop the left most component of the path, and try to match it to one of the Name in Links.

		<!--TODO: check Kubo does this-->If you find a match you can then remember the CID. You MUST continue your search, however if you find a match again you MUST error.

	- `dag-pb` also named `PBNode` is the "outside" protobuf message, it is the first one you decode. It contain the list of links and some "opaque user data".
	- `dag-pb` also named `PBNode` is the "outside" protobuf message, it is the first one you decode. It contains the list of links and some "opaque user data".


		A so called "block limit" is in place, we do not allow any single block to be bigger than 2MiB.

		Implementation SHOULD try to not emit 1MiB bigger blocks, but MUST decode blocks <= 2MiB.


		### SHOULD NOT names

		Thoses names SHOULD NOT<!--MUST NOT ? in future revisions--> be used:


		A directory node is a named collection of file.

		The minimum valid `PBNode.Data` field for a directory is (pseudo-json): `{"Type":"Directory"}`, other values are covered in Metadata.


		Currently symlinks are not followable, that mean implementations needs to return symlinks objects and fail if a consumer tries to follow it through.

		This is a SHOULD level, you probably wont break much things if you start following them.


		They never have any childs, and thus are also known as single block files.

		Their size (both `dagsize` and `blocksize`) is the length of the block body.


		### IPLD `dag-pb`

		A very important other spec for unixfs is the [`dag-pb`](https://ipld.io/specs/codecs/dag-pb/spec/) IPLD spec:


		####### The sister-lists `PBNode.Links` and `decodeMessage(PBNode.Data).blocksizes`

		The sister-lists are the key point of why `dag-pb` is important for files.

		This allows us to concatenate smaller files together.

		Linked files would be loaded recursively with the same process following a DFS (Depth-First-Search) order.

	Moved to https://specs.ipfs.tech/architecture/unixfs/
	Moved to https://specs.ipfs.tech/unixfs-data-format/

	- Their size (both `dagsize` and `blocksize`) is the length of the block body.
	- Their size is the length of the block body (`Tsize` in parent is equal to `blocksize`).


		In UnixFS, a node can be encoded using two different multicodecs, listed below. More details are provided in the following sections:

		- `raw` (`0x55`), which are single block :ref[Files].


		### Metadata

		UnixFS currently supports two optional metadata fields.

Publish UnixFS specifications at specs.ipfs.tech #331

Are you sure you want to change the base?

Publish UnixFS specifications at specs.ipfs.tech #331

Conversation

Jorropo commented Oct 10, 2022 • edited by lidel

lidel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElPaisano Dec 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jorropo Dec 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aschmahmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aschmahmann Dec 5, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jorropo Jan 3, 2023 • edited

Choose a reason for hiding this comment

ElPaisano left a comment

Choose a reason for hiding this comment

ElPaisano Dec 6, 2022 • edited

Choose a reason for hiding this comment

ElPaisano Dec 6, 2022 • edited

Choose a reason for hiding this comment

ElPaisano Dec 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2color Jan 12, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BigLep commented Jan 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElPaisano left a comment

Choose a reason for hiding this comment

lidel left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Oct 4, 2023 • edited

Jorropo commented Oct 10, 2022 •

edited by lidel

ElPaisano Dec 6, 2022 •

edited

Jorropo Dec 3, 2022 •

edited

aschmahmann Dec 5, 2022 •

edited

Jorropo Jan 3, 2023 •

edited

ElPaisano Dec 6, 2022 •

edited

ElPaisano Dec 6, 2022 •

edited

ElPaisano Dec 6, 2022 •

edited

2color Jan 12, 2023 •

edited

lidel left a comment •

edited

lidel Oct 4, 2023 •

edited

lidel Oct 4, 2023 •

edited