Decentralized NAT traversal using nodes in the network #307

robertkiel · 2021-03-16T17:20:50Z

As invited by @vasco-santos in libp2p/js-libp2p#870, I'm creating a more detailed overview of HoprConnect, an alternate transport module for js-libp2p handling churn and NAT traversal.

Disclaimer: parts of the documentation are taken from our own documentation and therefore slightly HOPR-flavoured.

Rationale

HoprConnect was created in the context of HOPR as js-libp2p-tcp as well as js-libp2p-webrtc-star did not support automatic NAT traversal or required external resources such as (external) STUN or (external) TURN and the final NAT traversal required some tweaks on the client software. The idea was to encapsulate most of the logic that is required to tunnel consumer routers in a transport module and work on higher-level mechanisms such as packet mixing.

Desired properties

tunnel consumer routers, not just relay traffic over publicly available nodes
no dependence on external resources such as STUN servers or TURN servers
decentralized relay network, nodes can find neighbors in the network that help them tunneling NATs
handling churn: nodes may come and leave
compliance with js-libp2p Transport API specifications

Addressing

HoprConnect uses two kind of addresses:

Direct addresses:
/ip4/<IPv4 address>/tcp/<port>/p2p/<HOPR address>
/ip6/<IPv6 address>/tcp/<port>/p2p/<HOPR address>
A node is available at the given IP address using the given TCP port and it is expected to talk to a node that has the HOPR address given through the Multiaddr. Direct connections using UDP or QUIC are not yet supported.
Relay address
/p2p/<HOPR relay address>/p2p-circuit/p2p/<HOPR address>
A node is available by first establishing a connection to the relay node as given by the first HOPR address. The relay is then asked to establish a connection to the second HOPR address.

Socket interfaces

HoprConnect binds to a TCPv4 and a UDPv4 socket, the ports can and are intended to be the same.

UDPv4 is used exclusively for answering STUN request, which means that every node using HoprConnect is also a potential STUN server.

TCPv4 is used for everything else.

IPv6 is foreseen but not yet implemented.

Connection setup

Assume that A intends to talk to B and A knows a few direct address from B as well as some indirect addresses aka relay addresses.

A first tries to contact B using the direct addresses which can fail if the other node is living behind a NAT router. If this works, then the connection is kept.

Otherwise the node tries to connect to one of the given relays by using the indirect addresses. Once the connection to the relay is established, the node asks the relay to establish a connection to the final destination, B. The relay tries to contact the requested node and answers with OK if successful or FAIL_COULD_NOT_REACH_COUNTERPARTY if not accessible. If the destination could not be reached by the relay, the node tries a different relay and if there is none, the connection attempt is aborted.

Once the relayed connection is established, the node starts exchanging payload data with the destination. At the same time, both nodes, A and B initiate a WebRTC connection and check whether A and B can connect directly.
If a direct connection is possible, the relayed connection is transparently replaced by a direct connection.

Reconnects

Reconnects between direct connections such as TCP and WebRTC instances are handled automatically and mostly transparently by the operating system and WebRTC.

For relayed connections, this need to be handled explicitly because nodes do not get that kind of feedback from the other nodes automatically. More precisely, the node on one end of the relay stays unaware of happenings on the other end as long as the information is not actively forwarded.

HoprConnect implements this behavior by giving feedback to the sender of the message whether it has been successfully forwarded or not. If this message cannot be forwarded then the connection is paused until the node reconnects. Note that the relay does not cache the messages, it just tells the sender to stop sending and reject the reception.

The connection stays “half-open” until the node on the other side reconnects and thereby overwrites the existing connection. Once that happens, the relay injects a RECONNECT message into the message stream, notifying the other party about the necessity to restart the encryption layer.

Once the relayed connection is established, the both nodes do exactly the same as when establishing a "normal" connection: they start a WebRTC instance at both ends of the connection and checkout whether they can connect directly and transparently switch to a direct WebRTC connection if that is possible.

Bootstrapping

Once a node is started, it first tries to detect its own public IPv4 address by using any node in the network to answer its STUN request.

The following is WIP

Afterwards, it tries to connect to known relay nodes and announce to other nodes behind which nodes it is available.

WIP End

Comparison with other NAT traversal techniques

normal TCP: no NAT Traversal
using UPNP or NAT-PMP: depends on the router and some routers don't understand it
relay everything: relay(s) become a single point of failure and subject of congestion, also bad in terms of privacy
webRTC-star: needs external signalling servers as well as external TURN servers, different processes and addtional ports
using a mixture of the aforementioned solutions: needs a lot of extra work when building decentralized software on the client

Potential browser-to-browser extension

The relay code is kept pretty agnostic where the connection comes from, which means that it can easily accept a HTTP(S) or even a WebSocket (Secure) stream and feed this stream into another stream on the other side of the relay. The missing part here is a browser implementation that establishes a relayed HTTP or WebSocket stream with one of the relay nodes and then transparently replace it with direct WebRTC connection if this is possible, otherwise it should keep the relayed connection.

The text was updated successfully, but these errors were encountered:

mxinden · 2021-03-31T08:23:30Z

@robertkiel I am sorry for the delay here. I will follow up later today or tomorrow.

mxinden · 2021-04-01T09:51:32Z

Thanks for bearing with us and thanks for the detailed post above.

First off, providing a (decentralized) way for nodes behind NATs and firewalls to connect, cross platforms (browser, Node, Golang, Rust, ...) is something we are very much interested in and also working on today. Thus I am happy to see your proposal.

Questions

I have a couple of follow up questions:

Do I understand correctly that https://github.com/hoprnet/hopr-connect only supports Node, i.e. does not support to be run in a browser? If so, why use WebRTC instead of TCP-Noise-Yamux or QUIC?
The relay code is kept pretty agnostic where the connection comes from

Do I understand correctly that you implemented your own relay protocol, if so, is there a specification of this protocol? Is there a particular reason why you didn't use the circuit relay v1 protocol? We are currently designing the circuit relay v2 protocol, which will be used as a relay in Project Flare.

Project Flare

Project Flare will allow non-browser to non-browser NAT hole punching on TCP and QUIC. It will use circuit relay v2 to relay the coordination protocol. The task of STUN is done via AutoNat.

WebRTC

Project Flare as it is designed today won't work in browsers. One can not control the TCP or UDP (for QUIC) sockets directly, thus can't require port-reuse. One can not directly connect to non-ssl-protected endpoints. Requiring all non-browser nodes to offer ssl is hard to say the least, but even then browser-to-browser won't work.

As far as I can tell the only way forward to fully support browsers is through WebRTC. WebRTC support for non-JS is in progress, e.g. see go-libp2p-webrtc-direct. In addition there is a spec proposal for WebRTC signaling

Proposal for future steps

To deduplicate work but also to not fragment the ecosystem, I think it is very much worth the effort, to synchronize any future work. Off the top of my head I see two things:

Settle on a common WebRTC specification, see Add WebRTC Transport Spec #220.
Instead of a custom relay protocol, we should collaborate together on circuit relay v2 (specification yet to be written). See also this discussion which as well proposes a shared signaling protocol across transports (WebRTC, TCP, ...).
Merge the AutoNAT and STUN effort, i.e. have AutoNAT support a subpart of the STUN specification to be used by nodes using WebRTC (browser).

robertkiel · 2021-04-01T13:38:59Z

Hi @mxinden ,

thanks for your reply!

Why use WebRTC?

First of all, it already exists which is a big benefit because hole-punching is already solved and maintained by the Chrom(ium) team. Same for the encryption system DTLS and the RTP implementation. Also the WebRTC signalling seems to follow some specification and the detection whether a direct connection is possible works quite well.

The only remaining issue was to feed the WebRTC instances with the right messages and transparently handle the TURN fallback in case we cannot connect directly ("WebRTC signalling fails"). This turned out to be quite tricky, especially when considering a decent degree of churn (nodes joining and leaving the network with same or different ip addresses).

Another interesting point that became clear during the development is that WebRTC can be used from a browser to establish direct connections, hence there exists a potential way to have direct browser-to-browser connection after exchanging signalling messages over a different channel.

Custom relay implementation

HoprConnect indeed uses a custom relay connection. It turned out to be a bit unflexible to use js-libp2p's relay connection as it is too much baked into js-libp2p and therefore a bit tricky to control in order to handle fallbacks and connection upgrades such as relayed connection -> direct webrtc connection.

On the other hand, handling fallbacks and reconnects and WebRTC signalling messages made it necessary to inject certain status messages and prefixes to properly multiplex messages. But I'm sure that we can merge both efforts.

Project Flare

Sounds very interesting - the only downside is that neither Node.js nor (all modern) browsers support QUIC directly, so HoprConnect is using plain-old TCP connections to exchange messages. Nevertheless Node.js seems to bring QUIC support soon, currently it is available behind a compile-time flag.

I also noticed that you are developing a custom NAT hole-punching solution which I personally find quite challenging since NAT implementations seem to be quite inhomogenous which makes testing very hard.

Browser-to-Browser

Connections between two browers indeed don't work without any signalling over a relayed connection to exchange hole-punching information. The way that I see is to use non-browser instances that listen to HTTP(S) streams and contact them from the browser using POST requests or WS(S) data connetions to exchange data.

AutoNAT and STUN

The reason for embedding STUN is that WebRTC only supports standard STUN and thus requires a STUN server which is realized in HoprConnect by using a library that binds to a UDP socket and answers STUN requests, so STUN is not really part of the protocol.

mxinden · 2021-04-02T15:32:16Z

HoprConnect indeed uses a custom relay connection. It turned out to be a bit unflexible to use js-libp2p's relay connection as it is too much baked into js-libp2p and therefore a bit tricky to control in order to handle fallbacks and connection upgrades such as relayed connection -> direct webrtc connection.

I can not comment on the feasibility of using a shared relay implementation in JS, though I strongly believe that we should at least strive for on-the-wire compatibility both between JS implementations and all others (e.g. Golang, Rust, ...). We will likely have a first specification draft of circuit relay v2 in the upcoming weeks. I would very much appreciate your input on the draft to make sure it suits your implementation as well.

I also noticed that you are developing a custom NAT hole-punching solution which I personally find quite challenging since NAT implementations seem to be quite inhomogenous which makes testing very hard.

Correct. Though the test results we have today are very promising both via QUIC and TCP.

robertkiel · 2021-04-06T09:27:51Z

Let me summarize a bit:

Component	HoprConnect	Project Flare
Node-to-Node base communication	TCP	TCP or QUIC
NAT capability detection	STUN	AutoNAT + UPnP
Relay protocol	custom	to be specified, see libp2p/go-libp2p-circuit#125
Hole-punching information exchange protocol	WebRTC + JSON	DCUtr
Hole-punching	WebRTC	?
Connection fallback / upgrade handling	custom	?
Encryption layer	DTLS over UDP	QUIC or TCP-Noise-Yamux

During the implementation of HoprConnect, I've noticed that the relay / fallback / upgrade logic can be very agnostic from the way how NAT traversal is done at the end of the day. Same for the node-to-node communication and capability detection.

I'd therefore suggest the following:

work together on the relay / fallback / upgrade logic and create a foundation that is compatible with other libp2p implementations
keep the relay / fallback / upgrade logic agnostic from concrete hole-punching solutions
solve browser-to-browser NAT traversal, most likely using WebRTC

vyzo · 2021-04-06T21:22:08Z

This matrix is wildly incorrect; autonat is only used to detect whether you are behind a NAT/firewall or not. It does not do NAT capability detection, hole punching, or holepunching coordination.
We have a separate protocol for the coordination, called DCUtr -- see #173

robertkiel · 2021-04-07T12:33:55Z

This matrix is wildly incorrect; autonat is only used to detect whether you are behind a NAT/firewall or not. It does not do NAT capability detection, hole punching, or holepunching coordination.
We have a separate protocol for the coordination, called DCUtr -- see #173

Good to know. I'm not that much into the libp2p ecosystem. Just updated the table accordingly. Could you name the other mistakes?

@vyzo maybe a DM could reduce some misunderstandings?

mxinden · 2021-04-08T16:07:05Z

@robertkiel I posted an extended version of your table above in #312. I would appreciate your input, especially in regards to HOPR connect.

I will draft a long term vision sometime soon. That should help us deduplicate efforts and and improve interoperability.

robertkiel mentioned this issue Mar 16, 2021

NAT Traversal with hopr-connect libp2p/js-libp2p#870

Closed

0xjjpa mentioned this issue Apr 29, 2021

Update hopr-connect to automatically relay whenever intermediate node has no direct connection to recipient hoprnet/hoprnet#1494

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decentralized NAT traversal using nodes in the network #307

Decentralized NAT traversal using nodes in the network #307

robertkiel commented Mar 16, 2021 •

edited

mxinden commented Mar 31, 2021

mxinden commented Apr 1, 2021

robertkiel commented Apr 1, 2021 •

edited

mxinden commented Apr 2, 2021

robertkiel commented Apr 6, 2021 •

edited

vyzo commented Apr 6, 2021

robertkiel commented Apr 7, 2021 •

edited

mxinden commented Apr 8, 2021

Decentralized NAT traversal using nodes in the network #307

Decentralized NAT traversal using nodes in the network #307

Comments

robertkiel commented Mar 16, 2021 • edited

Rationale

Desired properties

Addressing

Socket interfaces

Connection setup

Reconnects

Bootstrapping

Comparison with other NAT traversal techniques

Potential browser-to-browser extension

mxinden commented Mar 31, 2021

mxinden commented Apr 1, 2021

Questions

Project Flare

WebRTC

Proposal for future steps

robertkiel commented Apr 1, 2021 • edited

mxinden commented Apr 2, 2021

robertkiel commented Apr 6, 2021 • edited

vyzo commented Apr 6, 2021

robertkiel commented Apr 7, 2021 • edited

mxinden commented Apr 8, 2021

robertkiel commented Mar 16, 2021 •

edited

robertkiel commented Apr 1, 2021 •

edited

robertkiel commented Apr 6, 2021 •

edited

robertkiel commented Apr 7, 2021 •

edited