basichost: don't wait for Identify #2551

marten-seemann · 2023-09-03T04:31:38Z

By not waiting for Identify to finish we save 1 RTT during connection establishment. Establishing a QUIC connection now only takes a single roundtrip.

Multistream is (again) giving us a hard time here, mostly due to multiformats/go-multistream#20, and due to the lack of stream reset error codes. The situation is expected to improve once we introduce and roll out error codes (libp2p/specs#479).

marten-seemann · 2023-09-06T04:09:52Z

I tested this manually by setting the RTT to 100ms (sudo tc qdisc add dev lo root netem delay 50ms) and instrumenting TestPing (in p2p/test/transport) to output the duration. It came down from 3 RTTs to 2 RTTs. Unfortunately, we don't have rigorous tests for this (libp2p/test-plans#52).

Stebalien · 2023-09-06T16:00:51Z

Error codes aren't relevant here, the issue is that multistream supports multiple rounds of negotiation and doesn't have a clear "success" message.

I'm not going to block this, but:

This is unsound and will always be unsound unless multistream is entirely replaced.
We need to do very careful testing of upstream applications as we've been assuming this behavior for quite a while. For example, in the past, successfully opening a stream to a peer on some protocol generally meant that that peer supported said protocol. Now, opening a stream with exactly one protocol will always succeed instantly regardless of what that peer supports.

marten-seemann · 2023-09-06T16:08:23Z

1. This is unsound and will always be unsound unless multistream is entirely replaced.

Agreed.

Error codes aren't relevant here, the issue is that multistream supports multiple rounds of negotiation and doesn't have a clear "success" message.

They would help us return the correct error here, without having to guess what a stream reset means, as this PR currently does. That logic feels very brittle. With error codes, we'd have a dedicated error code for "protocol not supported", so you could immediately tell a stream reset initiated by multistream and a stream reset initiated by the application protocol itself apart.

2. We need to do very careful testing of upstream applications as we've been assuming this behavior for quite a while.

Any suggestion how to do that, other than running a Kubo / Lotus node for a while and seeing if anything breaks?

Stebalien · 2023-09-06T17:15:09Z

They would help us return the correct error here, without having to guess what a stream reset means, as this PR currently does. That logic feels very brittle. With error codes, we'd have a dedicated error code for "protocol not supported", so you could immediately tell a stream reset initiated by multistream and a stream reset initiated by the application protocol itself apart.

Ah, I see. Yeah, you're right.

Stebalien · 2023-09-06T17:15:37Z

Any suggestion how to do that, other than running a Kubo / Lotus node for a while and seeing if anything breaks?

Unfortunately, that's your best bet. Run kubo, check the tests, make a PR to lotus with the change and I'm happy to run it on my node for a while.

sukunrt

I will review closely later but I think this case has not been handled:

so the logic would be: if we receive a stream reset while multistream is running
if Identify claimed that the application protocol is supported, we return the stream reset error
otherwise: we replace that error with the multistream ErrNotSupported

from: https://filecoinproject.slack.com/archives/C03K82MU486/p1693817331703339?thread_ts=1693737018.663209&cid=C03K82MU486

If identify has completed we need to return the Reset error and not Proto not supported.

marten-seemann · 2023-09-07T08:06:53Z

Do we really need this complexity? I’d prefer just implementing the actual solution (error codes).

sukunrt

I think we will see the client side(the one using optimistic select) to be resetting the stream on protocols which exchange length prefixed messages.
The server side of multistream select will be waiting to receive the next protocol. The client will optimistically write <length><msg>. The server side reads this and replies with <na> provided the message is less than 1kb in size. Receiving this will cause the client to reset the stream.

sukunrt · 2023-09-16T06:27:53Z

p2p/host/basic/basic_host.go

+	// If pids contains only a single protocol, optimistically use that protocol (i.e. don't wait for
+	// multistream negotiation).


Should we add a summary of the algorithm to the documentation for the method NewStream

sukunrt · 2023-09-16T06:32:08Z

p2p/host/basic/basic_host.go

+		select {
+		case <-h.ids.IdentifyWait(s.Conn()):
+		case <-ctx.Done():
+			_ = s.Reset()


NIT: Is just s.Reset() better?

sukunrt · 2023-09-16T06:32:16Z

p2p/host/basic/basic_host.go

+		var err error
+		pref, err = h.preferredProtocol(p, pids)
+		if err != nil {
+			_ = s.Reset()


NIT: Is just s.Reset() better?

sukunrt · 2023-09-16T06:38:51Z

p2p/host/basic/basic_host.go

+
+	calledRead atomic.Bool


Should we document why we need this? Maybe with a link to multiformats/go-multistream#20

Why do we need this streamWrapper?
All it does is wrap CloseWrite, which can be concurrent with Read and Write.
@MarcoPolo @Stebalien

sukunrt · 2023-09-16T07:35:41Z

p2p/host/basic/basic_host.go

+	if s.calledRead.Load() && errors.Is(err, network.ErrReset) {
+		return n, msmux.ErrNotSupported[protocol.ID]{Protos: []protocol.ID{s.Protocol()}}


I think this should be !s.calledRead.Load()

There is a race condition here, but I think it doesn't matter becuase I'm not sure if you can use streams in the way that is required to trigger this condition.

The race is:
goroutine1:
does successful read and is going to do CompareAndSwap

then:
goroutine2:
does write and receives StreamReset for some reason(can it?).

now goroutine2 does the if s.calledRead.Load() && errors.Is(err, network.ErrReset) before goroutine1 could do CompareAndSwap.

I don't see the race condition. I think this should be !s.calledRead.Load(). My assumption is that the goal here is that either Read or Write return a msmux.ErrNotSupported on this specific type of error, and that it's okay if both return a msmux.ErrNotSupported. I don't think you can enter a race condition where neither return `msmux.ErrNotSupported, but you can enter one where both return it (which is okay).

I now think this whole logic should be a part of multistream.lazyClientConn

yeah, I agree.

MarcoPolo

I think this PR is missing a couple things:

A test to assert the expected number of round trips.
I think it would be great if this new behavior was an option. Then we can default it off, and allow users to opt-in. In time we can discuss flipping the default, ideally after major users opt-in to it.

I do think this is a good addition and will improve certain use cases (DHT off the top of my head). So it would be good to keep advancing this.

MarcoPolo · 2024-03-12T19:59:38Z

p2p/host/basic/basic_host.go

+	if s.calledRead.Load() && errors.Is(err, network.ErrReset) {
+		return n, msmux.ErrNotSupported[protocol.ID]{Protos: []protocol.ID{s.Protocol()}}


I don't see the race condition. I think this should be !s.calledRead.Load(). My assumption is that the goal here is that either Read or Write return a msmux.ErrNotSupported on this specific type of error, and that it's okay if both return a msmux.ErrNotSupported. I don't think you can enter a race condition where neither return `msmux.ErrNotSupported, but you can enter one where both return it (which is okay).

Stebalien · 2024-03-19T23:16:20Z

I do think this is a good addition and will improve certain use cases (DHT off the top of my head). So it would be good to keep advancing this.

If, and only if, we stop using the "old" DHT protocol (which, honestly, we should). Unfortunately, it'll mean that upgrading the DHT protocol (leading to a migration period) will always reduce performance.

The correct solution (IMO) is to somehow communicate protocol information as early as possible but... eh.

Stebalien · 2024-03-19T23:16:59Z

Oh, I see we have removed the "old" DHT protocol. Yeah, this could make a significant difference.

marten-seemann force-pushed the dont-wait-for-identify branch 6 times, most recently from 43c80ce to 45211f8 Compare September 6, 2023 03:57

marten-seemann requested review from sukunrt and Stebalien September 6, 2023 07:23

marten-seemann marked this pull request as ready for review September 6, 2023 08:42

sukunrt reviewed Sep 7, 2023

View reviewed changes

sukunrt reviewed Sep 16, 2023

View reviewed changes

marten-seemann mentioned this pull request Dec 7, 2023

basichost: don't wait for Identify before returning host.Connect #1817

Closed

sukunrt mentioned this pull request Feb 5, 2024

basichost: don't error if identify doesn't complete within deadline #2698

Closed

sukunrt mentioned this pull request Feb 22, 2024

v0.34 #2704

Open

22 tasks

marten-seemann added 4 commits March 9, 2024 14:36

holepunch: fix assertion in test

1e48734

basichost: don't wait for Identify in Host.Connect

c66780e

basichost: don't wait for Identify in Host.NewStream

6817a9d

interpret stream resets as multistream errors

6fede0d

sukunrt force-pushed the dont-wait-for-identify branch from 45211f8 to 6fede0d Compare March 9, 2024 10:10

update stream rcmgr tracking test

8811255

MarcoPolo reviewed Mar 12, 2024

View reviewed changes

MarcoPolo mentioned this pull request Apr 22, 2024

v0.35 #2778

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

basichost: don't wait for Identify #2551

basichost: don't wait for Identify #2551

marten-seemann commented Sep 3, 2023 •

edited

marten-seemann commented Sep 6, 2023

Stebalien commented Sep 6, 2023

marten-seemann commented Sep 6, 2023

Stebalien commented Sep 6, 2023

Stebalien commented Sep 6, 2023

sukunrt left a comment

marten-seemann commented Sep 7, 2023

sukunrt left a comment •

edited

sukunrt Sep 16, 2023

sukunrt Sep 16, 2023

sukunrt Sep 16, 2023

sukunrt Sep 16, 2023

sukunrt Mar 9, 2024

sukunrt Sep 16, 2023

sukunrt Sep 16, 2023

MarcoPolo Mar 12, 2024

sukunrt Mar 21, 2024 •

edited

MarcoPolo Mar 21, 2024

MarcoPolo left a comment

MarcoPolo Mar 12, 2024

Stebalien commented Mar 19, 2024

Stebalien commented Mar 19, 2024

		// If pids contains only a single protocol, optimistically use that protocol (i.e. don't wait for
		// multistream negotiation).

		if s.calledRead.Load() && errors.Is(err, network.ErrReset) {
		return n, msmux.ErrNotSupported[protocol.ID]{Protos: []protocol.ID{s.Protocol()}}

basichost: don't wait for Identify #2551

Are you sure you want to change the base?

basichost: don't wait for Identify #2551

Conversation

marten-seemann commented Sep 3, 2023 • edited

marten-seemann commented Sep 6, 2023

Stebalien commented Sep 6, 2023

marten-seemann commented Sep 6, 2023

Stebalien commented Sep 6, 2023

Stebalien commented Sep 6, 2023

sukunrt left a comment

Choose a reason for hiding this comment

marten-seemann commented Sep 7, 2023

sukunrt left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sukunrt Mar 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoPolo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien commented Mar 19, 2024

Stebalien commented Mar 19, 2024

marten-seemann commented Sep 3, 2023 •

edited

sukunrt left a comment •

edited

sukunrt Mar 21, 2024 •

edited