New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
complete reads of lazy nodes in walks #303
Conversation
@@ -259,7 +261,17 @@ func traverse(ctx context.Context, ls *ipld.LinkSystem, root cid.Cid, s ipld.Nod | |||
if err != nil { | |||
return fmt.Errorf("root blk load failed: %s", err) | |||
} | |||
err = progress.WalkMatching(rootNode, sel, func(_ traversal.Progress, _ ipld.Node) error { | |||
err = progress.WalkMatching(rootNode, sel, func(_ traversal.Progress, node ipld.Node) error { | |||
if lbn, ok := node.(datamodel.LargeBytesNode); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we worried here about maps and lists as well? e.g. if I match on a list or map do I need to iterate through the whole thing so I get all the blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't currently have any ADLs that exercise that functionality, so i think we figure out how to better re-factor and not need consumers of traversals to all replicate increasingly complicated logic on how to manage lazy loading once we have an example, like hamts, that we try to actually register as an available reifier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about UnixFS directories, aren't they map ADLs https://github.com/ipfs/go-unixfsnode/blob/650a6150508a73accb8d8014aa39400ad0657433/hamt/shardeddir.go#L161?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay if we file an issue for it? i don't think adding iteration of lists/maps here is going to do anything other than adding another potential form of inconsistency at this point, so i'd like to figure out what we actually need to do in ipld-prime-traversal level for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I match on a list or map do I need to iterate through the whole thing so I get all the blocks
I would think we're safe in that you're going to touch all of the blocks involved in getting to your leaves even through map & list ADLs, not necessarily all the blocks, but the right ones according to the selector. If you want the whole map, then your selector should probably be adjusted, and maybe you shouldn't InterpretAs
. I think even if you're doing an ADL walk and matching-all then you should get all of the list or map contents that are relevant to the ADL so that'd still be covered I think (again, don't do it as an ADL if you want to be sure it's exhaustive on the whole DAG).
What about Slice
though? That seems like a relevant matcher to take into account here doesn't it? If I only want a subset of bytes, I can craft a selector for it and I would expect only the relevant blocks to be included. Maybe that's a niche use-case, but it's still a use-case? (If you think this is relevant then that could just be shunted to TODO in the code and maybe an issue as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willscott Wont this discard the actual leaf bytes that we are looking for as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aarshkshah1992 the important part of this is that we load the blocks, that's where the list of blocks to include is being catalogued. So we can discard the bytes, but in the process we're forcing it to load all of the blocks involved so they're going to be included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even if you're doing an ADL walk and matching-all then you should get all of the list or map contents that are relevant to the ADL so that'd still be covered I think
Interesting idea. Does this mean it doesn't really make sense to "match" on a directory/list rather than the elements anyhow? Or does it only not make sense within block-transfer contexts (CAR files, downloading data over Bitswap/GraphSync, etc.).?
What about Slice though? That seems like a relevant matcher to take into account here doesn't it? If I only want a subset of bytes, I can craft a selector for it and I would expect only the relevant blocks to be included.
IIUC this case with Slices is the main point of this PR. Will this not cover that? Perhaps once the tests are added (combined with #305) it'll be easier to see. IIUC Slice
covers String, Bytes (and LargeBytes). Do we have to add code to cover String + Bytes as well as LargeBytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re matching - technically yeah, currently with the usecases we have, you really should only be concerned about the nodes that get you to the leaves that you care about. Perhaps there will be cases (in the future) where you're using an ADL that has hidden metadata that's not touched by a selector but really should be used; so maybe in the "I want all of this" cases, you should avoid doing it at the ADL level and just run an exhaustive selector from the root on the untyped node.
Re Slice
- you're right - this uses Matcher_Subset
which uses Slice
internally which is doing the LargeBytesNode
@ https://github.com/ipld/go-ipld-prime/blob/651b1cd5be900ae04b703bdff39b7727c26f36a6/traversal/selector/matcher.go#L52
While nice, there's only fairly marginal gains to add tests for anything other than LargeBytesNode
since we're pretty focused on which blocks are being loaded and we currently don't have an interface like that for ADLs for Strings (i.e. without an equivalent of LargeBytesNode
you're going to get the whole string--all nodes--even if it's an ADL and you want only a subset). This PR (& #305) are focused on block loading and limiting that to what's needed.
sgtm, but tests |
going to prefer combining (and writing the test) in #305 - as that has the rest of what's needed for a traversal using a partial matcher to work as expected |
Closing now that #305 is merged. |
This change is to update the traversals in go-car's v2 selective writer use the same traversal behavior as we have introduced in graphsync for unixfs adl lazy-loading node:
https://github.com/ipfs/go-graphsync/blob/49f40a571071f2c871f5548bd52c08ebc115f841/requestmanager/server.go#L140-L148
It means that when a node implements the
LargeBytesNode
interface, we want to read the associated io.reader such that children accessed by the ADL are triggered and we can see the affected substrate blocks that will be needed by subsequent usage.