Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webrtc: test pion fixes for state change callbacks ordering #2732

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

sukunrt
Copy link
Member

@sukunrt sukunrt commented Mar 12, 2024

Fixes: #2614
This is a problem with pion/webrtc's OnConnectionStateChange handler. It should wait for the handler to finish. Right now connection notifications might get reordered leading to listener thinking the connection has established and the dialer thinks connection hasn't established yet.

The fix in pion/webrtc is here:
https://github.com/pion/webrtc/pull/2702/files

The same error in pion/ice
https://github.com/pion/ice/pull/656/files

Update: the text below helped me explains how I went about debugging this
I think this fixes the problem This might help debug the problem of tests taking 10m. My theory is that the ICE connection succeeds on dialer and times out on listener(20s). So the listener removes the ufrag from its udp mux. After keepalive time, the dialer sends a STUN Request and initiates a new connection on the listener. This of course will fail because the dialer has exited the ice connection establishment mode and is simply sending STUN packets to keep the connection alive.

If we look at the failure here: https://github.com/libp2p/go-libp2p/actions/runs/8244362080/job/22546464639

listener is stuck on
2024-03-12T06:24:47.2310488Z /home/runner/work/go-libp2p/go-libp2p/p2p/transport/webrtc/listener.go:234 +0x886

	select {
	case <-ctx.Done():
		return nil, ctx.Err()
	case err := <-errC:
		if err != nil {
			return nil, fmt.Errorf("peer connection failed for ufrag: %s", candidate.Ufrag)
		}

This context expires after 20 seconds.
And all the goroutines on the dialer side are stuck for 9 minutes. And all the goroutines on the listener side are stuck for < minute(which is why that's not shown in the logs)
The logs also show that the dialer believes that ice connection on its side has completed.

Unfortunately this is very difficult to test. This has not flaked in the 16 tries I have done after that.

We still have to explain why the dialer doesn't exit on DTLS failure. DTLS failure should reflect in the OnConnectionStateChange callback.

Comment on lines 78 to 82
const (
DefaultDisconnectedTimeout = 20 * time.Second
DefaultFailedTimeout = 30 * time.Second
DefaultKeepaliveTimeout = 15 * time.Second
DefaultKeepaliveTimeout = 5 * time.Second
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have 4 keep alives before timing out than just 1 as before.

@sukunrt sukunrt force-pushed the webrtc-flaky-fix branch 8 times, most recently from 8bb55dc to 8ce861f Compare March 12, 2024 17:37
@sukunrt sukunrt marked this pull request as draft March 12, 2024 17:42
@sukunrt sukunrt force-pushed the webrtc-flaky-fix branch 7 times, most recently from 123ec1f to 4bd7d63 Compare March 13, 2024 06:31
@sukunrt sukunrt force-pushed the webrtc-flaky-fix branch 10 times, most recently from 4a46d71 to 96620ae Compare March 13, 2024 11:15
@sukunrt sukunrt force-pushed the webrtc-flaky-fix branch 6 times, most recently from fa9396a to 4d76e74 Compare March 13, 2024 13:19
@sukunrt sukunrt changed the title webrtc: make context timeout consistent with pion timeout webrtc: test pion fixes for state change callbacks ordering Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

webrtc: potential problem with the Pion state machine
1 participant