Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved resumption of requests #357

Open
hannahhoward opened this issue Feb 12, 2022 · 5 comments
Open

Improved resumption of requests #357

hannahhoward opened this issue Feb 12, 2022 · 5 comments
Assignees
Labels
need/triage Needs initial labeling and prioritization

Comments

@hannahhoward
Copy link
Collaborator

hannahhoward commented Feb 12, 2022

We have 2 rounds of "how do I pick my requests after I left off?" extensions:

DoNotSendCIDs - by provide a list of CIDs not to send, I can specify exactly blocks what I want the requestor to provide. This provides an extension with predictable behavior with a remote that can be used to resume requests. The problem is that it's extremely cumbersome for large DAGs and we quickly hit situations where sending all the CIDs we don't want is just not feasible. It also requires the requestor to assemble a list of CIDs of what it has somehow.

DoNotSendFirstBlocks - this is our current extension. We simply pass a block index at which to start sending blocks, with the instruction to simply not send anything any blocks before that point in the traversal. This extension works in most cases and is very compact. However, does not ultimately provide a predictable resume. The responder might skip over data it was missing in the initial transfer. If it acquired that data in the time before the request was resumed, suddenly DoNotSendFirstBlocks could behave quite unpredictably, as now those first blocks include blocks that weren't there before. (I almost forgot this was the reason I pursued the DoNotSendCIDs initially) In practice, this isn't much of a problem in Filecoin, where SPs tend to have the same data they've always had (even if they are missing some of it). But it seems like an issue for IPFS or a Retrieval Market where people's data stores change all the time.

Both of these extensions have an additional problem: they require the server to reexecute the selector traversal up to the point of resumption, even if the provider is not intend to send the data.

It seems to me we need an extensions that:

  • can represent a resume point in the traversal compactly
  • can do it in a predictable way.
  • allows the provider to optimally skip as much traversal as is possible up to the point of resumption

I believe the simplest way to represent this is an IPLD path. Moreover, there's already a proposed addition to selector traversal to allow it to resume at a path -- ipld/go-ipld-prime#358

I'm not sure the above is the exact right solution, but I believe a ResumeAtPath extension is our best path forward.

@hannahhoward hannahhoward added the need/triage Needs initial labeling and prioritization label Feb 12, 2022
@willscott
Copy link
Collaborator

Resume at path seems potentially good.

I'll note that go-ipld-prime#358 does not fix the 'optimally skip as much traversal as is possible' in a way that's particularly better than what should be possible with the current 'DoNotSendFirstBlocks' option.

In both cases, additional state than that sent on the wire is needed. It is neither a sub-path, nor a count of blocks that is sufficient to efficiently re-calculate a point in the traversal.

In both cases, the ability to efficiently resume relies on the fact that it is a resumption - that there has recently been a request for the same traversal that has partially completed and for which the server still has cached state.
It's the application / use of that cached state that can provide efficient resumption.

As long as the server has a clean and pro-active way to signal if can honor the request for resumption, the choice of number of blocks vs path doesn't matter too much.

@rvagg
Copy link
Member

rvagg commented Feb 14, 2022

^ Agreed, but maybe this is just an incremental improvement over current state that solves primarily for the second of the criteria - predictability ("determinism"). Sending a path is less efficient than a block count, but much more efficient than a CID list. So this seems like a reasonable incremental improvement.

Perhaps there's another iteration in the future where state is saved in some kind of cache that can allow for actually efficient resumption, without exposing the protocol to a big DoS risk. One of the challenges with this might be that the resumption point that the client wants may be different to the cached state that the server has (because that's the nature of unexpected network disconnects). It might be that we have an efficient resumption from an agreed upon pause point, and inefficient resumption when we have a mismatch in shared state. But that could just be an internal concern for the sender - if it's asked to resume at a path that doesn't match the state that it's cached, then it has to restart the traversal using ipld/go-ipld-prime#358.

@hannahhoward
Copy link
Collaborator Author

I'll note that go-ipld-prime#358 does not fix the 'optimally skip as much traversal as is possible' in a way that's particularly better than what should be possible with the current 'DoNotSendFirstBlocks' option.

I do not believe this to be the case.

First to clarify:

In both cases, additional state than that sent on the wire is needed. It is neither a sub-path, nor a count of blocks that is sufficient to efficiently re-calculate a point in the traversal.
In both cases, the ability to efficiently resume relies on the fact that it is a resumption - that there has recently been a request for the same traversal that has partially completed and for which the server still has cached state.
It's the application / use of that cached state that can provide efficient resumption.

This is not the case. Resuming in its current form operates by cancelling a request and sending a new one. The reason we have these extensions at all is to allow NEW GraphSync requests to start a specific point.

DoNotSendFirstBlocks operates by starting a selector traversal on the responder from scratch. It loads every block in the traversal from disk, but simply avoids sending the blocks over the wire until it reaches a specific point in the traversal.

ipld/go-ipld-prime#358 similarly is designed to start a traversal from scratch. However, unlike a block based approach, starting from a path allows us to skip over entire portions of the traversal at each tree depth. At each depth level, you only load those links which come after the corresponding segment of the StartAtPath. This should allow you to skip not just sending blocks over the wire, but execution of entire segments of the selector traversal (i.e. not loading them from disk).

I guess we can prove it out in benchmarking, but for a sufficiently deep tree, or even a UnixFS file, it should be wildly more efficient.

@hannahhoward
Copy link
Collaborator Author

One more note: resume at path does have a seperate efficiency problem for dags that traverse a particular block many times -- since it doesn't traverse the DAG in its entiriety, there's no way to deduplicate sending blocks that are encountered both before and after the startup point in the traversal. So ultimately, it's a tradeoff -- for a deeply interlinked DAG, starting from a path absent additional state is inefficient. But for a DAG with minimal duplication -- i.e. most unixfs files -- start from a path is optimal even without extra information.

@rvagg
Copy link
Member

rvagg commented Feb 17, 2022

At each depth level, you only load those links which come after the corresponding segment of the StartAtPath.

yeah ok, I buy that; you have to perform some, but not all of the existing selector in order to catch up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

3 participants