Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basichost: don't wait for Identify #2551

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

marten-seemann
Copy link
Contributor

@marten-seemann marten-seemann commented Sep 3, 2023

By not waiting for Identify to finish we save 1 RTT during connection establishment. Establishing a QUIC connection now only takes a single roundtrip.

Multistream is (again) giving us a hard time here, mostly due to multiformats/go-multistream#20, and due to the lack of stream reset error codes. The situation is expected to improve once we introduce and roll out error codes (libp2p/specs#479).

@marten-seemann marten-seemann force-pushed the dont-wait-for-identify branch 6 times, most recently from 43c80ce to 45211f8 Compare September 6, 2023 03:57
@marten-seemann
Copy link
Contributor Author

I tested this manually by setting the RTT to 100ms (sudo tc qdisc add dev lo root netem delay 50ms) and instrumenting TestPing (in p2p/test/transport) to output the duration. It came down from 3 RTTs to 2 RTTs. Unfortunately, we don't have rigorous tests for this (libp2p/test-plans#52).

@marten-seemann marten-seemann marked this pull request as ready for review September 6, 2023 08:42
@Stebalien
Copy link
Member

Error codes aren't relevant here, the issue is that multistream supports multiple rounds of negotiation and doesn't have a clear "success" message.

I'm not going to block this, but:

  1. This is unsound and will always be unsound unless multistream is entirely replaced.
  2. We need to do very careful testing of upstream applications as we've been assuming this behavior for quite a while. For example, in the past, successfully opening a stream to a peer on some protocol generally meant that that peer supported said protocol. Now, opening a stream with exactly one protocol will always succeed instantly regardless of what that peer supports.

@marten-seemann
Copy link
Contributor Author

1. This is unsound and will always be unsound unless multistream is entirely replaced.

Agreed.

Error codes aren't relevant here, the issue is that multistream supports multiple rounds of negotiation and doesn't have a clear "success" message.

They would help us return the correct error here, without having to guess what a stream reset means, as this PR currently does. That logic feels very brittle. With error codes, we'd have a dedicated error code for "protocol not supported", so you could immediately tell a stream reset initiated by multistream and a stream reset initiated by the application protocol itself apart.

2. We need to do very careful testing of upstream applications as we've been assuming this behavior for quite a while.

Any suggestion how to do that, other than running a Kubo / Lotus node for a while and seeing if anything breaks?

@Stebalien
Copy link
Member

They would help us return the correct error here, without having to guess what a stream reset means, as this PR currently does. That logic feels very brittle. With error codes, we'd have a dedicated error code for "protocol not supported", so you could immediately tell a stream reset initiated by multistream and a stream reset initiated by the application protocol itself apart.

Ah, I see. Yeah, you're right.

@Stebalien
Copy link
Member

Any suggestion how to do that, other than running a Kubo / Lotus node for a while and seeing if anything breaks?

Unfortunately, that's your best bet. Run kubo, check the tests, make a PR to lotus with the change and I'm happy to run it on my node for a while.

Copy link
Member

@sukunrt sukunrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will review closely later but I think this case has not been handled:

so the logic would be: if we receive a stream reset while multistream is running
if Identify claimed that the application protocol is supported, we return the stream reset error
otherwise: we replace that error with the multistream ErrNotSupported

from: https://filecoinproject.slack.com/archives/C03K82MU486/p1693817331703339?thread_ts=1693737018.663209&cid=C03K82MU486

If identify has completed we need to return the Reset error and not Proto not supported.

@marten-seemann
Copy link
Contributor Author

Do we really need this complexity? I’d prefer just implementing the actual solution (error codes).

Copy link
Member

@sukunrt sukunrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will see the client side(the one using optimistic select) to be resetting the stream on protocols which exchange length prefixed messages.
The server side of multistream select will be waiting to receive the next protocol. The client will optimistically write <length><msg>. The server side reads this and replies with <na> provided the message is less than 1kb in size. Receiving this will cause the client to reset the stream.

// If pids contains only a single protocol, optimistically use that protocol (i.e. don't wait for
// multistream negotiation).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a summary of the algorithm to the documentation for the method NewStream

select {
case <-h.ids.IdentifyWait(s.Conn()):
case <-ctx.Done():
_ = s.Reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Is just s.Reset() better?

var err error
pref, err = h.preferredProtocol(p, pids)
if err != nil {
_ = s.Reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Is just s.Reset() better?


calledRead atomic.Bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document why we need this? Maybe with a link to multiformats/go-multistream#20

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this streamWrapper?
All it does is wrap CloseWrite, which can be concurrent with Read and Write.
@MarcoPolo @Stebalien

if s.calledRead.Load() && errors.Is(err, network.ErrReset) {
return n, msmux.ErrNotSupported[protocol.ID]{Protos: []protocol.ID{s.Protocol()}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be !s.calledRead.Load()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race condition here, but I think it doesn't matter becuase I'm not sure if you can use streams in the way that is required to trigger this condition.

The race is:
goroutine1:
does successful read and is going to do CompareAndSwap

then:
goroutine2:
does write and receives StreamReset for some reason(can it?).

now goroutine2 does the if s.calledRead.Load() && errors.Is(err, network.ErrReset) before goroutine1 could do CompareAndSwap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the race condition. I think this should be !s.calledRead.Load(). My assumption is that the goal here is that either Read or Write return a msmux.ErrNotSupported on this specific type of error, and that it's okay if both return a msmux.ErrNotSupported. I don't think you can enter a race condition where neither return `msmux.ErrNotSupported, but you can enter one where both return it (which is okay).

Copy link
Member

@sukunrt sukunrt Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now think this whole logic should be a part of multistream.lazyClientConn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree.

Copy link
Contributor

@MarcoPolo MarcoPolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is missing a couple things:

  1. A test to assert the expected number of round trips.
  2. I think it would be great if this new behavior was an option. Then we can default it off, and allow users to opt-in. In time we can discuss flipping the default, ideally after major users opt-in to it.

I do think this is a good addition and will improve certain use cases (DHT off the top of my head). So it would be good to keep advancing this.

if s.calledRead.Load() && errors.Is(err, network.ErrReset) {
return n, msmux.ErrNotSupported[protocol.ID]{Protos: []protocol.ID{s.Protocol()}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the race condition. I think this should be !s.calledRead.Load(). My assumption is that the goal here is that either Read or Write return a msmux.ErrNotSupported on this specific type of error, and that it's okay if both return a msmux.ErrNotSupported. I don't think you can enter a race condition where neither return `msmux.ErrNotSupported, but you can enter one where both return it (which is okay).

@Stebalien
Copy link
Member

I do think this is a good addition and will improve certain use cases (DHT off the top of my head). So it would be good to keep advancing this.

If, and only if, we stop using the "old" DHT protocol (which, honestly, we should). Unfortunately, it'll mean that upgrading the DHT protocol (leading to a migration period) will always reduce performance.

The correct solution (IMO) is to somehow communicate protocol information as early as possible but... eh.

@Stebalien
Copy link
Member

Oh, I see we have removed the "old" DHT protocol. Yeah, this could make a significant difference.

@MarcoPolo MarcoPolo mentioned this pull request Apr 22, 2024
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants