Improved resumption of requests #357

hannahhoward · 2022-02-12T01:33:20Z

We have 2 rounds of "how do I pick my requests after I left off?" extensions:

DoNotSendCIDs - by provide a list of CIDs not to send, I can specify exactly blocks what I want the requestor to provide. This provides an extension with predictable behavior with a remote that can be used to resume requests. The problem is that it's extremely cumbersome for large DAGs and we quickly hit situations where sending all the CIDs we don't want is just not feasible. It also requires the requestor to assemble a list of CIDs of what it has somehow.

DoNotSendFirstBlocks - this is our current extension. We simply pass a block index at which to start sending blocks, with the instruction to simply not send anything any blocks before that point in the traversal. This extension works in most cases and is very compact. However, does not ultimately provide a predictable resume. The responder might skip over data it was missing in the initial transfer. If it acquired that data in the time before the request was resumed, suddenly DoNotSendFirstBlocks could behave quite unpredictably, as now those first blocks include blocks that weren't there before. (I almost forgot this was the reason I pursued the DoNotSendCIDs initially) In practice, this isn't much of a problem in Filecoin, where SPs tend to have the same data they've always had (even if they are missing some of it). But it seems like an issue for IPFS or a Retrieval Market where people's data stores change all the time.

Both of these extensions have an additional problem: they require the server to reexecute the selector traversal up to the point of resumption, even if the provider is not intend to send the data.

It seems to me we need an extensions that:

can represent a resume point in the traversal compactly
can do it in a predictable way.
allows the provider to optimally skip as much traversal as is possible up to the point of resumption

I believe the simplest way to represent this is an IPLD path. Moreover, there's already a proposed addition to selector traversal to allow it to resume at a path -- ipld/go-ipld-prime#358

I'm not sure the above is the exact right solution, but I believe a ResumeAtPath extension is our best path forward.

willscott · 2022-02-13T18:16:29Z

Resume at path seems potentially good.

I'll note that go-ipld-prime#358 does not fix the 'optimally skip as much traversal as is possible' in a way that's particularly better than what should be possible with the current 'DoNotSendFirstBlocks' option.

In both cases, additional state than that sent on the wire is needed. It is neither a sub-path, nor a count of blocks that is sufficient to efficiently re-calculate a point in the traversal.

In both cases, the ability to efficiently resume relies on the fact that it is a resumption - that there has recently been a request for the same traversal that has partially completed and for which the server still has cached state.
It's the application / use of that cached state that can provide efficient resumption.

As long as the server has a clean and pro-active way to signal if can honor the request for resumption, the choice of number of blocks vs path doesn't matter too much.

rvagg · 2022-02-14T03:57:07Z

^ Agreed, but maybe this is just an incremental improvement over current state that solves primarily for the second of the criteria - predictability ("determinism"). Sending a path is less efficient than a block count, but much more efficient than a CID list. So this seems like a reasonable incremental improvement.

Perhaps there's another iteration in the future where state is saved in some kind of cache that can allow for actually efficient resumption, without exposing the protocol to a big DoS risk. One of the challenges with this might be that the resumption point that the client wants may be different to the cached state that the server has (because that's the nature of unexpected network disconnects). It might be that we have an efficient resumption from an agreed upon pause point, and inefficient resumption when we have a mismatch in shared state. But that could just be an internal concern for the sender - if it's asked to resume at a path that doesn't match the state that it's cached, then it has to restart the traversal using ipld/go-ipld-prime#358.

hannahhoward · 2022-02-17T01:04:51Z

I'll note that go-ipld-prime#358 does not fix the 'optimally skip as much traversal as is possible' in a way that's particularly better than what should be possible with the current 'DoNotSendFirstBlocks' option.

I do not believe this to be the case.

First to clarify:

In both cases, additional state than that sent on the wire is needed. It is neither a sub-path, nor a count of blocks that is sufficient to efficiently re-calculate a point in the traversal.
In both cases, the ability to efficiently resume relies on the fact that it is a resumption - that there has recently been a request for the same traversal that has partially completed and for which the server still has cached state.
It's the application / use of that cached state that can provide efficient resumption.

This is not the case. Resuming in its current form operates by cancelling a request and sending a new one. The reason we have these extensions at all is to allow NEW GraphSync requests to start a specific point.

DoNotSendFirstBlocks operates by starting a selector traversal on the responder from scratch. It loads every block in the traversal from disk, but simply avoids sending the blocks over the wire until it reaches a specific point in the traversal.

ipld/go-ipld-prime#358 similarly is designed to start a traversal from scratch. However, unlike a block based approach, starting from a path allows us to skip over entire portions of the traversal at each tree depth. At each depth level, you only load those links which come after the corresponding segment of the StartAtPath. This should allow you to skip not just sending blocks over the wire, but execution of entire segments of the selector traversal (i.e. not loading them from disk).

I guess we can prove it out in benchmarking, but for a sufficiently deep tree, or even a UnixFS file, it should be wildly more efficient.

hannahhoward · 2022-02-17T01:10:29Z

One more note: resume at path does have a seperate efficiency problem for dags that traverse a particular block many times -- since it doesn't traverse the DAG in its entiriety, there's no way to deduplicate sending blocks that are encountered both before and after the startup point in the traversal. So ultimately, it's a tradeoff -- for a deeply interlinked DAG, starting from a path absent additional state is inefficient. But for a DAG with minimal duplication -- i.e. most unixfs files -- start from a path is optimal even without extra information.

rvagg · 2022-02-17T04:37:08Z

At each depth level, you only load those links which come after the corresponding segment of the StartAtPath.

yeah ok, I buy that; you have to perform some, but not all of the existing selector in order to catch up.

hannahhoward added the need/triage Needs initial labeling and prioritization label Feb 12, 2022

hannahhoward assigned willscott and rvagg Feb 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved resumption of requests #357

Improved resumption of requests #357

hannahhoward commented Feb 12, 2022 •

edited

willscott commented Feb 13, 2022

rvagg commented Feb 14, 2022

hannahhoward commented Feb 17, 2022

hannahhoward commented Feb 17, 2022

rvagg commented Feb 17, 2022

Improved resumption of requests #357

Improved resumption of requests #357

Comments

hannahhoward commented Feb 12, 2022 • edited

willscott commented Feb 13, 2022

rvagg commented Feb 14, 2022

hannahhoward commented Feb 17, 2022

hannahhoward commented Feb 17, 2022

rvagg commented Feb 17, 2022

hannahhoward commented Feb 12, 2022 •

edited