Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: implement likelihood sampling based peer scheduling and add work stealing across peers #292

Open
Roasbeef opened this issue Dec 11, 2023 · 2 comments

Comments

@Roasbeef
Copy link
Member

This issue is spun off from some findings we've seen with neutrino over time, either when we have a lot of active peers, or some of the peers are connected over tor. This'll serve as a mastering tracking issue for a number of improvements we can make to this area of the code base. I also consider much of this to be preparatory work before we begin parallel block download in btcd in earnest. This was spun off discussions in lightningnetwork/lnd#8250.

Sampling Based Peer Ranking

Today the default peer tanking algorithm will just re-order the peers:

// peerRanking is a struct that keeps history of peer's previous query success
// rate, and uses that to prioritise which peers to give the next queries to.
type peerRanking struct {
// rank keeps track of the current set of peers and their score. A
// lower score is better.
rank map[string]uint64
}
) in a map. Each time we go to schedule something new, we use the ranking to sort a slice of the peer, then go attempt to give the work to the next free peer: https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L226-L256.

As peers start to fill up, then we'll allocate a task to them next peer in line. In the wild, we've seen that this'll end up allocating tasks to slow peers.

To get around this, we should look at using a sampling based method. So we'll map the rankings into a probability that the peer will respond properly instead.

Dynamic Ping Tuned Timeouts

Today we have a hard coded 2 second timeout for all requests. This works in settings of very good network connections, but for tor nodes, this might be too slow.

Instead, we should use the ping latency to tune the timeout for a given peer, based on the extrapolated RTT. This can also be fed into the peer ranking itself. We should also refresh this metric in the background (as network jitter can change perceived latency, especially on tor).

Work Stealing Workers

We should add work stealing to ensure that fast peers are able to dynamically increase the load to increase overall throughput. With work stealing peers would check at the top of their loop if another peer has work that has yet to be claimed, then adding that to its own queue. The Go scheduler also uses work stealing itself to ensure all processors are utilized: https://rakyll.org/scheduler/.

Along the way, we should modify the work queue to add better type safety via type params.

@Chinwendu20
Copy link
Contributor

As peers start to fill up, then we'll allocate a task to them next peer in line. In the wild, we've seen that this'll end up allocating tasks to slow peers.

I do not think so, currently, tasks are assigned to a peer one at a time and if the peer takes too long, it times out and the task is reassigned to another worker (peer). In my understanding with the current set up, there is really "no work to steal"

I think the dynamic ping time-out would be useful though in determining the most "realistic" time for a time out giving the RTT.

So we'll map the rankings into a probability that the peer will respond properly instead.

Currently, peers are awarded a score based on their history of responsiveness that reduces or increases their likelihood of being assigned a task, do you think that would suffice?

Along the way, we should modify the work queue to add better type safety via type params.

Care to throw more light on this one?

@Roasbeef
Copy link
Member Author

I do not think so, currently, tasks are assigned to a peer one at a time and if the peer takes too long, it times out and the task is reassigned to another worker (peer

So this is about more active queue management. That timeout is done by the scheduler, rather than a peer stealing the work directly. What we have right now is a sorted list, and we assign to directly to that, not being aware of any active load.

Notice how the main loop can block when sending work to a peer if the queue is full:
https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L237-L258

Instead, it should recognize that the queue is full, to allocate to another peer. We should move to a thread-safe queue instead of the current channel based mechanism.

Check out that link I posted at the end, re how work stealing works in the Go compiler.

Currently, peers are awarded a score based on their history of responsiveness that reduces or increases their likelihood of being assigned a task, do you think that would suffice?

Re-reading this, I think it can be even simpler:

  • Select two peers that are capable of taking on the work.
  • Pick one of them at random to assign
  • Have the peers work steal in the background to smooth things out.

Along the way, we should modify the work queue to add better type safety via type params.

Get rid of all the type assertions: https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L215.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants