Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: is one graphsync.ResponseProgress message per Node in a graph reasonable? #326

Open
rvagg opened this issue Jan 6, 2022 · 3 comments
Labels
need/triage Needs initial labeling and prioritization

Comments

@rvagg
Copy link
Member

rvagg commented Jan 6, 2022

A Request() returns a channel that gives you all of the Nodes of a DAG traversal, rather than the Blocks of that traversal.

In datatransfer, I see that it’s just doing a for range req.responseChan {} so they’re not used, I don’t know where else they might be consumed though.
But for a DAG with very complex blocks (like the filecoin chain, for example), this means that if someone uses a matching selector, they could trigger the creation of a graphsync.ResponseProgress for every single point of data (Node) within each of the blocks.

So, for example, running the plain blockchain used in the test cases, with 99 blocks, each of which is just {Parents:[CID],Messages:[bytes,...]}. Even with this very simple block format and a selector that limits itself to exploring Parents, each block generates two progress messages, one for visiting the Parents and one for visiting the first element of the list. So a 99 block chain, with this super-simple structure, and a selector that’s limiting itself to just Parents generates 198 messages.

Since we use WalkAdv it doesn’t even matter if it’s a matching or explore selector, it’ll trigger this for every single data element within a block.

Which goes to some of the concerns I’ve been raising recently about how we use selectors practically and the distinction between Nodes and Blocks getting in the way—what would the user of this API expect from a Request()? Is it reasonable that they get an exhaustive list of messages for each Node, or would they more logically expect one for each Block? Or maybe neither since we’re really just using it for a block store. And in terms of performance, what are the implications of having so many messages per block. Even for DAG-PB blocks, we’re going to be triggering over 100 messages for a single block in many cases (e.g. MFS or just dense directory blocks).

@rvagg rvagg added the need/triage Needs initial labeling and prioritization label Jan 6, 2022
@rvagg
Copy link
Member Author

rvagg commented Jan 6, 2022

In terms of user API expectation, I could imagine one of these two scenarios:

  1. I want progress in terms of blocks that you've successfully traversed, so one message per block would be better.
  2. I want progress in terms of the selector I'm using, and if I don't use a matching selector (just an explore one), then maybe I don't expect any progress at all?

@hannahhoward
Copy link
Collaborator

hannahhoward commented Jan 6, 2022

Yea. In short, I agree... and I'm not sure where we should head with this to start.

The original thought was: Graphsync is a protocol that works with IPLD graphs, it should return IPLD data, in the new format that supports selectors, aka go-ipld-prime nodes.

The problem is under the hood Graphsync is basically about moving blocks around, and it's not entirely clear how useful the "over the hood" part of actual IPLD data turns out to be.

Plus the distinction between Explore/Matching wasn't well defined when we first created go-graphsync and as you point out, we don't respect Matching in limiting returns results.

My initial inclination is just to support Matching properly and respect it when returning results, because it's pretty easy to do and doesn't change the API.

I feel like "match only on block boundary" is best done in the world of IPLD / Selectors? I'm not sure. Plus its trivial to write a filter and we've done that in the past so maybe that's the easiest.

My main goal would be: don't change the return types. If we can make that work, we should be good.

@rvagg
Copy link
Member Author

rvagg commented Jan 6, 2022

I feel like "match only on block boundary" is best done in the world of IPLD / Selectors? I'm not sure. Plus its trivial to write a filter and we've done that in the past so maybe that's the easiest.

Yeah, this is the next step I think. I've been suggesting that ipld-prime's traversal should have better affordances for blocks since it's what the majority of current use-cases use it for and we're forced to deal with the loader for most of that for now, which is a bit clunky in terms of API clarity.

marten-seemann pushed a commit that referenced this issue Mar 2, 2023
avoid an unneccesary serialization roundtrip when sending requests for selectors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

2 participants