complete reads of lazy nodes in walks #303

willscott · 2022-05-26T15:01:12Z

This change is to update the traversals in go-car's v2 selective writer use the same traversal behavior as we have introduced in graphsync for unixfs adl lazy-loading node:
https://github.com/ipfs/go-graphsync/blob/49f40a571071f2c871f5548bd52c08ebc115f841/requestmanager/server.go#L140-L148

It means that when a node implements the LargeBytesNode interface, we want to read the associated io.reader such that children accessed by the ADL are triggered and we can see the affected substrate blocks that will be needed by subsequent usage.

aschmahmann · 2022-05-26T17:04:04Z

v2/selective.go

@@ -259,7 +261,17 @@ func traverse(ctx context.Context, ls *ipld.LinkSystem, root cid.Cid, s ipld.Nod
 	if err != nil {
 		return fmt.Errorf("root blk load failed: %s", err)
 	}
-	err = progress.WalkMatching(rootNode, sel, func(_ traversal.Progress, _ ipld.Node) error {
+	err = progress.WalkMatching(rootNode, sel, func(_ traversal.Progress, node ipld.Node) error {
+		if lbn, ok := node.(datamodel.LargeBytesNode); ok {


Are we worried here about maps and lists as well? e.g. if I match on a list or map do I need to iterate through the whole thing so I get all the blocks?

we don't currently have any ADLs that exercise that functionality, so i think we figure out how to better re-factor and not need consumers of traversals to all replicate increasingly complicated logic on how to manage lazy loading once we have an example, like hamts, that we try to actually register as an available reifier

What about UnixFS directories, aren't they map ADLs https://github.com/ipfs/go-unixfsnode/blob/650a6150508a73accb8d8014aa39400ad0657433/hamt/shardeddir.go#L161?

okay if we file an issue for it? i don't think adding iteration of lists/maps here is going to do anything other than adding another potential form of inconsistency at this point, so i'd like to figure out what we actually need to do in ipld-prime-traversal level for that

if I match on a list or map do I need to iterate through the whole thing so I get all the blocks

I would think we're safe in that you're going to touch all of the blocks involved in getting to your leaves even through map & list ADLs, not necessarily all the blocks, but the right ones according to the selector. If you want the whole map, then your selector should probably be adjusted, and maybe you shouldn't InterpretAs. I think even if you're doing an ADL walk and matching-all then you should get all of the list or map contents that are relevant to the ADL so that'd still be covered I think (again, don't do it as an ADL if you want to be sure it's exhaustive on the whole DAG).

What about Slice though? That seems like a relevant matcher to take into account here doesn't it? If I only want a subset of bytes, I can craft a selector for it and I would expect only the relevant blocks to be included. Maybe that's a niche use-case, but it's still a use-case? (If you think this is relevant then that could just be shunted to TODO in the code and maybe an issue as well).

@willscott Wont this discard the actual leaf bytes that we are looking for as well ?

@aarshkshah1992 the important part of this is that we load the blocks, that's where the list of blocks to include is being catalogued. So we can discard the bytes, but in the process we're forcing it to load all of the blocks involved so they're going to be included.

I think even if you're doing an ADL walk and matching-all then you should get all of the list or map contents that are relevant to the ADL so that'd still be covered I think

Interesting idea. Does this mean it doesn't really make sense to "match" on a directory/list rather than the elements anyhow? Or does it only not make sense within block-transfer contexts (CAR files, downloading data over Bitswap/GraphSync, etc.).?

What about Slice though? That seems like a relevant matcher to take into account here doesn't it? If I only want a subset of bytes, I can craft a selector for it and I would expect only the relevant blocks to be included.

IIUC this case with Slices is the main point of this PR. Will this not cover that? Perhaps once the tests are added (combined with #305) it'll be easier to see. IIUC Slice covers String, Bytes (and LargeBytes). Do we have to add code to cover String + Bytes as well as LargeBytes?

Re matching - technically yeah, currently with the usecases we have, you really should only be concerned about the nodes that get you to the leaves that you care about. Perhaps there will be cases (in the future) where you're using an ADL that has hidden metadata that's not touched by a selector but really should be used; so maybe in the "I want all of this" cases, you should avoid doing it at the ADL level and just run an exhaustive selector from the root on the untyped node.

Re Slice - you're right - this uses Matcher_Subset which uses Slice internally which is doing the LargeBytesNode @ https://github.com/ipld/go-ipld-prime/blob/651b1cd5be900ae04b703bdff39b7727c26f36a6/traversal/selector/matcher.go#L52

While nice, there's only fairly marginal gains to add tests for anything other than LargeBytesNode since we're pretty focused on which blocks are being loaded and we currently don't have an interface like that for ADLs for Strings (i.e. without an equivalent of LargeBytesNode you're going to get the whole string--all nodes--even if it's an ADL and you want only a subset). This PR (& #305) are focused on block loading and limiting that to what's needed.

rvagg · 2022-05-27T05:55:27Z

sgtm, but tests

willscott · 2022-05-27T22:09:26Z

going to prefer combining (and writing the test) in #305 - as that has the rest of what's needed for a traversal using a partial matcher to work as expected

willscott · 2022-05-31T09:30:31Z

Closing now that #305 is merged.

complete reads of lazy nodes in walks

caf0f95

willscott requested review from rvagg, hannahhoward, aschmahmann and aarshkshah1992 May 26, 2022 15:01

aschmahmann reviewed May 26, 2022

View reviewed changes

willscott closed this May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

complete reads of lazy nodes in walks #303

complete reads of lazy nodes in walks #303

willscott commented May 26, 2022

aschmahmann May 26, 2022

willscott May 26, 2022

aschmahmann May 26, 2022 •

edited

willscott May 26, 2022

rvagg May 27, 2022

aarshkshah1992 May 27, 2022

rvagg May 27, 2022

aschmahmann May 30, 2022

rvagg May 31, 2022

rvagg commented May 27, 2022

willscott commented May 27, 2022 •

edited

willscott commented May 31, 2022

complete reads of lazy nodes in walks #303

complete reads of lazy nodes in walks #303

Conversation

willscott commented May 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aschmahmann May 26, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvagg commented May 27, 2022

willscott commented May 27, 2022 • edited

willscott commented May 31, 2022

aschmahmann May 26, 2022 •

edited

willscott commented May 27, 2022 •

edited