Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

[DRAFT] Quality Transport #162

Closed
Jorropo opened this issue Feb 25, 2020 · 5 comments
Closed

[DRAFT] Quality Transport #162

Jorropo opened this issue Feb 25, 2020 · 5 comments

Comments

@Jorropo
Copy link

Jorropo commented Feb 25, 2020

Some transport are better to use than other (QUIC > TCP > circuit) but currently swarm doesn't care about he just care the first one to show up.
That why we need a way to tell swarm which one is better.
Note when I say a transport is better I don't talk about the real quality of the transport (how well it was written or how buggy is it) but how a node should want to use it over other transports.

TL;DR

Interface change

Create QTransport, same as current Transport but Dial returns QCapableConn and with a Score function maybe estimating the Quality of a future connection.
Create QCapableConn embed CapableConn but with a Quality() uint32 function.

Notation algorithm

Quality() returns an uint32, lower the better, a connection with a score twice lower is not twice good its just better like a leaderboard.
Quality() must be easy and really fast to calculate and result must never change.
Loopback and local network preference must be moved here :
If conn is on a private network Quality() must divide score by 2^8 (shift right by 8).
If conn is on the loopback Quality() must divide score by 2^16 (shift right by 16).

Stream migration

Stream migration is not done by swarm but by applications themselves.
To do that on a network.Conn object its possible to register an OnBetter(func(time.Time)), or directly in NewStream to avoid race issue (registering after a better conn was already found).
The time in the callback represent when the hard close is gonna happend.

Async Close

When a new connection is found the old one is put in asyncClose mode. In this mode after a short grace time its not possible for new stream to be opened (bidirectional) but already open stream still works. After a grace time of 1 to 2 minutes the conn is shutdown.

Complete:

Interface change

Some new interfaces (QTransport temporary name, also true for other name) would be created in go-libp2p-core/transport:
QTransport embed transport except for Dial and Listen wich is same but returns a QCapableConn (a new interface). And a new Score(ma.Multiaddr, peer.ID) Score wich returns a Score struct :

// Score is used by transport to returns expectation about connection
type Score struct {
	// Score of the future connection, must not change once the connection is
	// actualy created
	Quality uint32
	// True if quality was actualy calculable (it may only be calculable with an
	// open connection).
	// Note: if IsQuality == true swarm will be able to cancel the dial if a
	// better transport is found.
	IsQuality bool
	// Number of file descriptor expected to be open when dialing.
	Fd uint8
}

Again QCapableConn embed CapableConn but with Quality() uint32 added (EDIT: actualy no golang doesn't allow for circular type matching so a Base object contains shared functions, see the implementation for that libp2p/go-libp2p-core#121).

Since go doesn't support shadowing of embedded object there will actualy be a TransportBase interface with all function except Dial and Transport and QTransport will embed TransportBase. For QCapableConn its not a problem because nothing need be shadowed.

This is not added in Transport so we can easly embed Transport in go-libp2p.Transport here (try cast to QTransport, if not ok do embed).

Notation algorithm

Uint32, lower the better, unsigned because a transport with N hops would be able to do something like this :
baseScore + underlying.Quality() * hops (multiplication with negative value is not a great for that).
So where to place a transport on the scale ?
Take a look at this scale (note: a proto with a Quality twice bigger is not twice slower or twice worst, it is just worst, see that like the place on a leaderboard) :

  • TCP, 2^31, multiplexed stated continuous protocol (stream have fight for a place in the underlying protocol)
  • QUIC, 2^30, async stated discontinuous protocol (stream doesn't have to fight but there is time taken to open the connection and stream)
  • Circuit, s.Conn().Quality() (assume s the stream used to connect to the router) + the number of hops * 8 (that consider that all router are equal but there is really no better than pinging/monitoring to know that) + 2^16 (base circuit value).

Its also not needed to follow that closely, thing can be added if a proto add some overhead :

  • WS tcp score + 50
  • WSS ws score + 150

Quality() result must never change for a single conn, so swarm implementation can safely assume the score to never change.

If conn is on a private network Quality() must divide score by 2^8 (shift right by 8).
If conn is on the loopback Quality() must divide score by 2^16 (shift right by 16).
The idea is for some transport (such as Bluetooth BLE) to implement this manualy, that avoid creating an exception each time in swarm.

2 transports can have the same score, in this case the first transport to return will be used.

Why not a ping based one ?

I don't think we should implement a tester/monitor on each connection quality and really picking the best one because that would require lot of rewrite of transport themselves (could be done elsewhere but I don't think that a good solution).
Because a monitor would have higher requirement than the subprocess needing the connection, because :

  • Some proto take some time settle a steady speed (TCP) after a Dial due to nat opening/punching.
  • Internet is no perfect, packets don't take the shortest route and often not the same route so we will need to monitor for a few packets to get a result usable (still not a good one).
  • That would need to be running continuously on cellular network due the high number of connection jumps.
  • That would need an other negociation protocol (or extends) because if the two peers don't get the same result and each chose to keep a different connection no connection would be possible at all.

Stream migration

Supporting quality add a problem, we maybe got a stream but that doesn't mean this stream is good and maybe a better came after (this case is gonna happen a lot when QUIC will be pushed as default UDP transport, high speed connection are way more sensitive to ping jitter (+-3ms over 7 is way more than +-10ms over 250) so TCP may return first even he is worst).

Actually migrating stream from one transport to an other would require a new complex abstraction layer so instead application and subsystem will be able to subscribe to an event either on the event bus or a callback OnBetter(cb func(time.Time)) on the network.Conn or as cb ...func(time.Time) in host.NewStream (if technicaly that a breaking change to host from the compiler point of view, I don't think that a problem because application using the current libp2p will still works).
Also the callback is not made with a pointer to the new conn because that avoid creating a stream on a conn that already async closed because multiple transport returned a the same time, just use NewStream in the callback.
The time in the callback represent when the connection is gonna be hard closed.

I prefer not using the event bus because there is no need, that a case where listeners and emitters of an event all share a common object, listeners obviously through s.Conn() and emitters through the map of the swarm storing activeDial.
Event could then be only implemented in swarm (and host but only to forward to swarm).

Async close and grace time

When a better transport is found we should stop using the old one but outdated nodes and maybe different implementation will keep using it.
So after swarm swapping of the 2 Conn object he should start an asyncClose() basicaly it puts the connection in a state where no more stream can be created (bidirectional) but already open streams continue to works and if a Stream closes it should check the number of open Stream on his Conn and if he was the last one he can really Close the conn.
Then 2 grace time will apply, the first one were we really stop accepting new connection (the idea is if a stream opening packet was traveling while we executed asyncClose he had time to arrive) (3 to 5 seconds should be enough) and a second one when we kill the connection even if some stream are staying (maybe we don't want it, I think that should be an option true by default, so users will have the good way by default if they don't want to update their application to be able to reopen stream), also if the other peers is incompatible the connection may be kept open until a real close is issued (and keep accepting stream), but closing if by luck all stream were been closed.

Note:

Again here its important for 2 peers to naturally agree on which transport to use because if not we would have to send a message warning that we put a connection in asyncClose mode, right here its not needed because the other peer will also do the same (due to the transport opening).

Note:

That have already been discussed before and this specificaly is not by me alone, it have been discussed the in the weekly 24/02/2020

@Stebalien
Copy link
Member

We don't need a new transport for this, we can just expose this information in the via the connection's Stat function.

@Jorropo
Copy link
Author

Jorropo commented Mar 2, 2020

We still need a new transport for the QTransport.Score function (not yet in the draft, I found needing that while implementing).
This take the same call args as Transport.Dial (without Context) and may return the future Quality of the a connection open with this raddr, this allows canceling a Dial if a better connection was found.
E.G. Dialing a peer with Circuit, TCP, QUIC:
First Circuit returns, NewStream start with this connection.
Then QUIC returns, change the pointer to *Conn in swarm so newer stream will use QUIC, then TCP dial is canceled because QUIC is better and finaly an OnBetter events is triggered on the Circuit Conn.

@Stebalien
Copy link
Member

We can either add a function to the main transport, better yet, use a type assertion to check for it.

@Jorropo
Copy link
Author

Jorropo commented Mar 2, 2020

Yes that what I do for the transport, I'm gonna add Quality to connection stats.

@Jorropo
Copy link
Author

Jorropo commented Mar 2, 2020

We don't need a new transport for this, we can just expose this information in the via the connection's Stat function.

We did in fact need it the goal is not for apps to get the Quality from their network.Stream but for swarm to get the quality from the transport. Except with the OnBetter apps will never interact with the quality.

@Jorropo Jorropo closed this as completed May 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants