Discussion: is one graphsync.ResponseProgress message per Node in a graph reasonable? #326

rvagg · 2022-01-06T03:21:29Z

A Request() returns a channel that gives you all of the Nodes of a DAG traversal, rather than the Blocks of that traversal.

In datatransfer, I see that it’s just doing a for range req.responseChan {} so they’re not used, I don’t know where else they might be consumed though.
But for a DAG with very complex blocks (like the filecoin chain, for example), this means that if someone uses a matching selector, they could trigger the creation of a graphsync.ResponseProgress for every single point of data (Node) within each of the blocks.

So, for example, running the plain blockchain used in the test cases, with 99 blocks, each of which is just {Parents:[CID],Messages:[bytes,...]}. Even with this very simple block format and a selector that limits itself to exploring Parents, each block generates two progress messages, one for visiting the Parents and one for visiting the first element of the list. So a 99 block chain, with this super-simple structure, and a selector that’s limiting itself to just Parents generates 198 messages.

Since we use WalkAdv it doesn’t even matter if it’s a matching or explore selector, it’ll trigger this for every single data element within a block.

Which goes to some of the concerns I’ve been raising recently about how we use selectors practically and the distinction between Nodes and Blocks getting in the way—what would the user of this API expect from a Request()? Is it reasonable that they get an exhaustive list of messages for each Node, or would they more logically expect one for each Block? Or maybe neither since we’re really just using it for a block store. And in terms of performance, what are the implications of having so many messages per block. Even for DAG-PB blocks, we’re going to be triggering over 100 messages for a single block in many cases (e.g. MFS or just dense directory blocks).

The text was updated successfully, but these errors were encountered:

rvagg · 2022-01-06T03:24:25Z

In terms of user API expectation, I could imagine one of these two scenarios:

I want progress in terms of blocks that you've successfully traversed, so one message per block would be better.
I want progress in terms of the selector I'm using, and if I don't use a matching selector (just an explore one), then maybe I don't expect any progress at all?

hannahhoward · 2022-01-06T20:42:40Z

Yea. In short, I agree... and I'm not sure where we should head with this to start.

The original thought was: Graphsync is a protocol that works with IPLD graphs, it should return IPLD data, in the new format that supports selectors, aka go-ipld-prime nodes.

The problem is under the hood Graphsync is basically about moving blocks around, and it's not entirely clear how useful the "over the hood" part of actual IPLD data turns out to be.

Plus the distinction between Explore/Matching wasn't well defined when we first created go-graphsync and as you point out, we don't respect Matching in limiting returns results.

My initial inclination is just to support Matching properly and respect it when returning results, because it's pretty easy to do and doesn't change the API.

I feel like "match only on block boundary" is best done in the world of IPLD / Selectors? I'm not sure. Plus its trivial to write a filter and we've done that in the past so maybe that's the easiest.

My main goal would be: don't change the return types. If we can make that work, we should be good.

rvagg · 2022-01-06T22:40:44Z

I feel like "match only on block boundary" is best done in the world of IPLD / Selectors? I'm not sure. Plus its trivial to write a filter and we've done that in the past so maybe that's the easiest.

Yeah, this is the next step I think. I've been suggesting that ipld-prime's traversal should have better affordances for blocks since it's what the majority of current use-cases use it for and we're forced to deal with the loader for most of that for now, which is a bit clunky in terms of API clarity.

avoid an unneccesary serialization roundtrip when sending requests for selectors

rvagg added the need/triage Needs initial labeling and prioritization label Jan 6, 2022

marten-seemann pushed a commit that referenced this issue Mar 2, 2023

fix(message): do not EncodeToNode selectors (#326)

d16098a

avoid an unneccesary serialization roundtrip when sending requests for selectors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: is one graphsync.ResponseProgress message per Node in a graph reasonable? #326

Discussion: is one graphsync.ResponseProgress message per Node in a graph reasonable? #326

rvagg commented Jan 6, 2022

rvagg commented Jan 6, 2022

hannahhoward commented Jan 6, 2022 •

edited

rvagg commented Jan 6, 2022

Discussion: is one graphsync.ResponseProgress message per Node in a graph reasonable? #326

Discussion: is one graphsync.ResponseProgress message per Node in a graph reasonable? #326

Comments

rvagg commented Jan 6, 2022

rvagg commented Jan 6, 2022

hannahhoward commented Jan 6, 2022 • edited

rvagg commented Jan 6, 2022

hannahhoward commented Jan 6, 2022 •

edited