Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relaxed reification #27

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

relaxed reification #27

wants to merge 2 commits into from

Conversation

willscott
Copy link
Collaborator

If a node is not an immediate pbnode, but can be built as such, then reifiy appropriately.

This allows a dag-pb node to have been serialized to cbor or json, and still be interpreted as unixfs data.

If a node is not an immediate pbnode, but can be built as such, then reifiy appropriately.

This allows a dag-pb node to have been serialized to cbor or json, and still be interpreted as unixfs data.
Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whacky, but neat

what's the use-case for this, and the car convert method? I'm intrigued!

@willscott
Copy link
Collaborator Author

I needed both of these for ipld/ipld#184 - if i want to represent blocks of data in dag-json format for specs, but still demonstrate a partial ADL'd traversal over them i needed car convert to re-link the data in the new format, and this one for the unixfs adl interpretAs to not be a no-op

Comment on lines +20 to +22
// see if the node has the right structure anyway
pbb := dagpb.Type.PBNode.NewBuilder()
if err := pbb.AssignNode(maybePBNodeRoot); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott this will this change the behavior of Reify such that it's no longer spec compliant? i.e. UnixFSv1 is only defined over DagPB nodes, will this allow me to create DagCBOR objects that will render as long as they are convertable into DagPB (i.e. they contain the fields in https://ipld.io/specs/codecs/dag-pb/spec/#serial-format)?

If so, you may want to define some LooseReify or something that can be used for testing and some conversion work but is called out as non spec-compliant.

I may have misinterpreted how this works though and if so please correct me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used in a selector where the requester explicitly signals interpretAs: unixfs. If the data at the point they signal fits the data structure of unixfs (e.g. a data field in a struct that can itself be unmarshaled with dagpb) that seems like a pretty reasonable thing to do, rather than block on the concrete implementation.

Note that this isn't DagPB the serialization, but rather DagPB the concrete schema types. I could have valid dagpb data, but if it's been re-built into a basicnode wrapper, e.g. due to a WalkTransforming or other ipld internal process, you would also have this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so this effectively is a UnixFS spec change. It means that any applications that rely on UnixFS behavior, and this package, (e.g. ipfs cat, ipfs.io/ipfs/bafyfoobar, ...) will have their behavior changed to be more relaxed.

I don't think modifying a spec implicitly for the purposes of testing is a particularly good idea. IIUC we ran into this problem last year where originally /ipfs/bafydagcbor/myunixfsfield was not supposed to resolve but did due to lax implementation which as a result of users building structures like this has effectively forced us to be stuck with this and cause problems for our UnixFS upgrade path.

Would it be problematic to just define a new ADL "UnixFSLoose" and we can use that without requiring a UnixFS spec change (which you could propose, but might take some time to accumulate and integrate feedback).


It seems like maintaining the UnixFS spec within the go-ipld-prime context and ADLs might start to get more complicated since UnixFS is only defined over DagPB.

Should we be considering a UnixFS spec change that redefines them in terms of either ADL "interfaces" or data model structures going forward here? Mostly I'd like us to not make spec changes in code without ecosystem involvement of the various people that depend on the behavior of UnixFS (i.e. a huge portion of the IPFS ecosystem).

I was thinking this might be a bit complicated and so a new ADL would allow us to side-step this.

cc @mvdan @rvagg @ribasushi in case you have any thoughts or advice here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spec you link does not specify that the outer structure that the unixfs protobuf is encoded within (e.g a struct with a 'data' and 'links' field) needs to be serialized using the dag pb codec. If i took that same data and happened to transfer it over the wire in a cbor codec, would it stop being unixfs data?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott it's a little obscure but the spec does mandate pb-in-pb:
https://github.com/ipfs/specs/blame/master/UNIXFS.md#L81

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even that can be interpreted with some flexibility: it specifies that the outer container of the data follows the merkeldag protocol buffer structure of 'Data' and 'Links', but does not express that it must be encoded using the 'dagpb' codec vs other codecs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data is serialized and placed inside the 'Data' field of the outer merkledag protobuf, which also contains the actual links to the child nodes of this object.

"Placed inside of the outer merkledag protobuf" seems pretty clear that it's a protobuf, not to mention all of the existing tooling that insists it's a protobuf.

Basically, I keep seeing perceived ambiguity in our specs cause implementation issues (dag-cbor in /ipfs, Bitswap protobuf ambiguity, /api/v0 on gateways, etc.). I am strongly opposed to changing specs without changing the specs, especially when a written spec already exists.

If we want to change the specs then let's open a spec PR and figure it out (changing specs is fine, but doing so in some Go repo is IMO not), otherwise I've been told that multiple ADLs are supposed to be allowed to exist so why not just make a new one?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➕ on above - no spec changes without spec changes please

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can continue in ipfs/specs#271

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like @aschmahmann has requested changes.

My first response is somewhat similar -- while this may not be be breaking the DagPB spec cause it's not a serialization, I still feel like it's a change to UnixFSv1 itself to accept non-protobuf formats. https://github.com/ipfs/specs/blob/master/UNIXFS.md

I can't decide whether I care, though if as this comment says, it's all for the spec tests, well, it seems like a not great reason to introduce potentially unexpected behavior, and further evidence that the spec tests are non-ideal to work with.

Anyway, net net, I think I'm fine either way?

@willscott
Copy link
Collaborator Author

@aschmahmann is my response sufficient to address your concerns?

Copy link
Member

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very enthusiastic about this, and it seems directionally correct to me.

There's no doubt in my mind that if we re-did unixfsv1, today, with everything we know now, it would be fully codec agnostic -- like this code is now becoming. The discussion about the history and what the unixfs specs say is true and also valid, but I'm definitely in favor of expanding those definitions. As this code proves, it is both possible and well defined, and also in fact easy to do.

It may or may not be worth adding explicit conditional branches in downstream tools that want to be strict about dag-pb codecs for some reason. (I don't have a very strong opinion about this. I don't immediately think of any reasons I would want to do that, as a downstream tool maintainer, but I certainly haven't thought about it long enough to rule anything out, either.) This seems to me like something worth discussing in this repo's changelog, but then leaving to downstream choice.

The utility of this for sheer testing and fixtures and communicatability when used with e.g. dag-json, as already discussed in earlier comments, is just impeccable. 🤩

@ribasushi
Copy link

if we re-did unixfsv1, today, with everything we know now, it would be fully codec agnostic

@warpfork please see ipfs/specs#271 (comment)

@rvagg
Copy link
Member

rvagg commented Apr 19, 2022

Given the unwillingness to change unixfs at a spec level, are there alternative approaches we can take here to support the kind of testing that this is attempting to enable?

@aschmahmann
Copy link
Contributor

are there alternative approaches we can take here to support the kind of testing that this is attempting to enable?

A couple I can think of at the moment are:

  • Have a UnixFS-loose ADL that does this and enables testing (and if it turns out to be more widely useful it'll be easier to motivate a spec change)
  • See if there's somewhere else to hang the hook for describing what codecs can be used (e.g. maybe the block loader or the LinkContext although they might be too global).
    • Note: this is just another example of the signaling problem, so we could just rely on one of the mechanisms we have which is the ADL name and however we are using the UnixFS selector currently.
    • It's possible that UnixFS won't be the only ADL that wants to be able to be tied to a given codec, so finding ways to describe this type of "codec-required" behavior better than doing a type cast (e.g. what if the DagPB implementation is of a different type) might be useful too (@Stebalien IIRC you alluded to these types of ADLs in a recent discussion)

@rvagg
Copy link
Member

rvagg commented Apr 20, 2022

Can I poke a bit more at the notion that this is an implicit spec change, because I'm not really sure it is:

  1. This is for the read interface, so it's kind of like our sloppy codecs that will decode your junk even if it wasn't encoded quite properly (the dag-pb example might be that Links were not ordered properly). On write we can stick to spitting out dag-pb blocks, but maybe that's not even a direct responsibility of this library.
  2. ADLs like this are just a lens through which to view something at the data model; do strictness requirements like "protobuf on protobuf" really belong here or in the consuming library. Wouldn't that be a concern for go-ipfs or some other layer above this library to deal with—just because my optometrist gives me glasses doesn't mean the only way I can possibly use them is directly in front of my eyes, even though a spec for glasses might say that this is the only sensible use. Perhaps I want to start a fire with them?
  3. For selectors, invoking Reify is an opt-in operation isn't it? You have to signal that you want the unixfs reifier. In practice that could mean that I could make a selector and interpretAs: "unixfs" and do it over dag-json data that I have stored, but that's an opt-in from the user. There's no magic path to accidentally doing this, is there?

Further to that last point and re other paths to Reify:

It means that any applications that rely on UnixFS behavior, and this package, (e.g. ipfs cat, ipfs.io/ipfs/bafyfoobar, ...) will have their behavior changed to be more relaxed.

I think the only place this matters is here: https://github.com/ipfs/go-ipfs/blob/67fdb6efcdd1d4c0c0ee6fef7190b92c13184dbd/core/node/core.go#L102

Which is where we could be strict about protobuf-on-protobuf, it could just become:

	unixFSFetcher := ipldFetcher.WithReifier(func(lc linking.LinkContext, n ipld.Node, ls *linking.LinkSystem) (ipld.Node, error) {
		_, ok := n.(dagpb.PBNode)
		if ok {
			return unixfsnode.Reify(lc, n, ls)
		}
		return n, nil
	})

Then we'd have the best of both worlds - this library would remain a lens and not be tied to codecs and we get to do all of the novel things that lenses should be able to do, but we'd also get protobuf-on-protobuf where it matters.

This library remains our most mature ADL and it would be nice to push it forward a bit more because it's the best example we have and currently our best path to maturing the concept of ADLs .. even if UnixFS comes with considerable baggage.

@rvagg
Copy link
Member

rvagg commented Apr 20, 2022

Entirely separate thought that might trigger some folks ... we could make a bunch of UnixFS layering mess go away if we just reinterpret dag-pb to include UnixFS decoding where it exists. So dag-pb is still {Data:.., Links:[..]} but it also has additional properties that decode UnixFS data where it's possible within the given Data (perhaps we add FileData and Metadata properties, leaving what is currently in Data unmolested, just interpreted, I did something similar in the Bitcoin codec and dag-jose has some echoes of this kind of oddity). We wouldn't necessarily include the named UnixFS pathing shenanigans, that'd still be a thing for go-ipfs to handle, and we wouldn't touch sharding, but we could squash protobuf-in-protobuf into a single library, and we could probably do it in a way that doesn't introduce much breakage (i.e. it'd be additive to the dag-pb schema, not subtractive or modifying).

@ribasushi
Copy link

squash protobuf-in-protobuf into a single library

This was the exact reason I pushed back on the change originally. There is no standalone UnixFS spec, it is directly tied to the underlying/outer PB. Aside from the LinkName being significant in UnixFs-Type-1, this part of the inner PB refer to a very strictly (albeit implicitly) defined portion of the outer dag-pb: https://github.com/ipfs/go-unixfs/blob/master/pb/unixfs.proto#L18

I am 💯 supportive rewriting the UnixFS spec to reflect what is in fact in the wild.

@aschmahmann
Copy link
Contributor

This is for the read interface, so it's kind of like our sloppy codecs that will decode your junk even if it wasn't encoded quite properly (the dag-pb example might be that Links were not ordered properly).

In my experience with IPLD thus far, every time we say "popular system X is loose, but you do you" it inevitably comes up a little while later as "can we change the spec to reflect reality and say that X is allowed to be loose". For example: ipld/ipld#196.

Given that go-unixfs is strict and is the largest producer and consumer of this data it seems like we're intentionally setting ourselves up for a spec change by both advocating for the use of this library over go-unixfs and making it looser.

ADLs like this are just a lens through which to view something at the data model; do strictness requirements like "protobuf on protobuf" really belong here or in the consuming library.

As long as it's clear to people which they should use this seems fine. However, saying that go-ipfs follows the UnixFS spec but we recommend people use selectors and the UnixFS ADL which does not... seems bad.

Which is where we could be strict about protobuf-on-protobuf, it could just become:

I thought a bit about this and while it could be nice there are two issues I see:

  1. We'd want to give users helpers and instructions so they don't accidentally use the loose version
  2. This is unfortunately insufficient. It looks sufficient because this PR fixes a single instance of DAG-PB checking. However, as it is this library is looser than go-unixfs since it doesn't check for DAG-PB while loading in blocks inside of the ADL (which this solution would not be able to handle).

For example: go-unixfs has https://github.com/ipfs/go-unixfs/blob/5cd23312773df8e0503fffe8238a163a58168f71/file/unixfile.go#L155

whereas this repo has

if substrate.Kind() == ipld.Kind_Bytes {
and
dataField, err := substrate.LookupByString("Data")
which do not check the codecs at all and which we should probably resolve in the same way we resolve this issue.

@rvagg
Copy link
Member

rvagg commented Apr 22, 2022

Well, to get to the core of my main point - I'd really like this library to have a very limited scope of concerns, as an ADL, which, if possible would mean being agnostic to the codec layer. There's multiple points of difficulty with that, but the one that concerns us here is on the read side, where this library is essentially handling Nodes, pass it a Node, get a Node-interface thing back that does fancier things than the original Node did. Pushing codec concerns to an outer layer would be ideal to try and achieve that, some options:

  • go-ipfs and other users, as I've already mentioned - the downside of this is that we're hiding spec things in code pretty deeply and users can show up here without that context. Perhaps that's solvable by documentation here?
  • go-unixfs - it's already got slow go-ipld-prime creep going on, like go-merkledag has, perhaps we embrace that and push it further by ironing out the overlap and moving functionality here, so we don't just have this two-versions problem (like we do now for go-merkledag/pb and go-codec-dagpb)
  • introduce a sub-package here, go-unixfsnode/pb that does all that messy stuff strictly, which I guess comes back to this "lose vs strict" thing that @aschmahmann already raised. But instead of "lose" being secondary, it's primary because go-unixfsnode is just a pure ADL, but the /pb package could be that strict piece that enfoces the dag-pb pieces we care about.

rvagg added a commit to ipfs/specs that referenced this pull request Jun 27, 2022
The strict coupling poses some problems for IPLD implementations that:

 1. do not retain codec information between the serialization and UnixFS (ADL)
    reification; and/or
 2. do not have a mechanism to strictly mandate that UnixFS data be _only_
    encoded in a particular codec.

A mandate in the specification that strictly defines the layering on top of
the codec makes it difficult to implement it as an ADL, which also presents
difficulty for using tooling that builds on IPLD lens-style layering.

Ref: ipfs/go-unixfsnode#27
Ref: #271
Ref: ipld/go-car#304
@rvagg
Copy link
Member

rvagg commented Jun 27, 2022

I've pushed an update to this which adds tests and also a ReifyDagPB() method that retains the type check.

After considering ipld/go-car#304 and some of the discussion in ipfs/specs#271 I'm less convinced this is even helpful. It does retain a status-quo option though and we can switch go-ipfs to use ReifyDagPB here: https://github.com/ipfs/go-ipfs/blob/862ce6bb8f6ce24b91ed3dc59e7406faed34f583/core/node/core.go#L102

I think we either need to merge ipfs/specs#294 to slightly loosen the spec a bit (my preferred option at the moment), or we're going to need to add some awkward wiring into go-ipld-prime that can tell us what the codec was when decoding the data or build some other adjacent mechanism that can retain this (which will still likely require some go-ipld-prime changes to deal with an external conditional that intercepts the reification).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants