Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

complete reads of lazy nodes in walks #303

Closed
wants to merge 1 commit into from
Closed

Conversation

willscott
Copy link
Member

This change is to update the traversals in go-car's v2 selective writer use the same traversal behavior as we have introduced in graphsync for unixfs adl lazy-loading node:
https://github.com/ipfs/go-graphsync/blob/49f40a571071f2c871f5548bd52c08ebc115f841/requestmanager/server.go#L140-L148

It means that when a node implements the LargeBytesNode interface, we want to read the associated io.reader such that children accessed by the ADL are triggered and we can see the affected substrate blocks that will be needed by subsequent usage.

@@ -259,7 +261,17 @@ func traverse(ctx context.Context, ls *ipld.LinkSystem, root cid.Cid, s ipld.Nod
if err != nil {
return fmt.Errorf("root blk load failed: %s", err)
}
err = progress.WalkMatching(rootNode, sel, func(_ traversal.Progress, _ ipld.Node) error {
err = progress.WalkMatching(rootNode, sel, func(_ traversal.Progress, node ipld.Node) error {
if lbn, ok := node.(datamodel.LargeBytesNode); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we worried here about maps and lists as well? e.g. if I match on a list or map do I need to iterate through the whole thing so I get all the blocks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't currently have any ADLs that exercise that functionality, so i think we figure out how to better re-factor and not need consumers of traversals to all replicate increasingly complicated logic on how to manage lazy loading once we have an example, like hamts, that we try to actually register as an available reifier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay if we file an issue for it? i don't think adding iteration of lists/maps here is going to do anything other than adding another potential form of inconsistency at this point, so i'd like to figure out what we actually need to do in ipld-prime-traversal level for that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I match on a list or map do I need to iterate through the whole thing so I get all the blocks

I would think we're safe in that you're going to touch all of the blocks involved in getting to your leaves even through map & list ADLs, not necessarily all the blocks, but the right ones according to the selector. If you want the whole map, then your selector should probably be adjusted, and maybe you shouldn't InterpretAs. I think even if you're doing an ADL walk and matching-all then you should get all of the list or map contents that are relevant to the ADL so that'd still be covered I think (again, don't do it as an ADL if you want to be sure it's exhaustive on the whole DAG).

What about Slice though? That seems like a relevant matcher to take into account here doesn't it? If I only want a subset of bytes, I can craft a selector for it and I would expect only the relevant blocks to be included. Maybe that's a niche use-case, but it's still a use-case? (If you think this is relevant then that could just be shunted to TODO in the code and maybe an issue as well).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott Wont this discard the actual leaf bytes that we are looking for as well ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aarshkshah1992 the important part of this is that we load the blocks, that's where the list of blocks to include is being catalogued. So we can discard the bytes, but in the process we're forcing it to load all of the blocks involved so they're going to be included.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even if you're doing an ADL walk and matching-all then you should get all of the list or map contents that are relevant to the ADL so that'd still be covered I think

Interesting idea. Does this mean it doesn't really make sense to "match" on a directory/list rather than the elements anyhow? Or does it only not make sense within block-transfer contexts (CAR files, downloading data over Bitswap/GraphSync, etc.).?

What about Slice though? That seems like a relevant matcher to take into account here doesn't it? If I only want a subset of bytes, I can craft a selector for it and I would expect only the relevant blocks to be included.

IIUC this case with Slices is the main point of this PR. Will this not cover that? Perhaps once the tests are added (combined with #305) it'll be easier to see. IIUC Slice covers String, Bytes (and LargeBytes). Do we have to add code to cover String + Bytes as well as LargeBytes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re matching - technically yeah, currently with the usecases we have, you really should only be concerned about the nodes that get you to the leaves that you care about. Perhaps there will be cases (in the future) where you're using an ADL that has hidden metadata that's not touched by a selector but really should be used; so maybe in the "I want all of this" cases, you should avoid doing it at the ADL level and just run an exhaustive selector from the root on the untyped node.

Re Slice - you're right - this uses Matcher_Subset which uses Slice internally which is doing the LargeBytesNode @ https://github.com/ipld/go-ipld-prime/blob/651b1cd5be900ae04b703bdff39b7727c26f36a6/traversal/selector/matcher.go#L52

While nice, there's only fairly marginal gains to add tests for anything other than LargeBytesNode since we're pretty focused on which blocks are being loaded and we currently don't have an interface like that for ADLs for Strings (i.e. without an equivalent of LargeBytesNode you're going to get the whole string--all nodes--even if it's an ADL and you want only a subset). This PR (& #305) are focused on block loading and limiting that to what's needed.

@rvagg
Copy link
Member

rvagg commented May 27, 2022

sgtm, but tests

@willscott
Copy link
Member Author

willscott commented May 27, 2022

going to prefer combining (and writing the test) in #305 - as that has the rest of what's needed for a traversal using a partial matcher to work as expected

@willscott
Copy link
Member Author

Closing now that #305 is merged.

@willscott willscott closed this May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants