Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSO severely degrades connection performance #4394

Closed
arashpayan opened this issue Mar 28, 2024 · 11 comments
Closed

GSO severely degrades connection performance #4394

arashpayan opened this issue Mar 28, 2024 · 11 comments

Comments

@arashpayan
Copy link
Contributor

I found a bug in an environment where all communication between machines was taking hundreds of thousands of times longer than expected. After some invetigation, I created a small program to reproduce the issue (https://github.com/cloudsoda/quic-go-gso-bug) and determined that the regression was introduced with v0.38.0 (the GSO release). The same program compiled with v0.37.6 of quic-go works just fine. I was able to prove it's GSO related because when using any release including and after v0.38.0 the program works fine if you pass QUIC_GO_DISABLE_GSO=true.

To be clear, this doesn't happen everywhere. Here's the config of the machines in the environment where it does occur.

Configuration:
Distribution: Red Hat Enterprise Linux
Version: 8.1
Kernel: 4.18.0-147.el8.x86_64
CPU: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (

Reproduction steps

  1. Clone the project at https://github.com/cloudsoda/quic-go-gso-bug
  2. Run the build.sh script to generate a statically linked binary named flubber.
  3. Run the server on a machine configured as described above: ./flubber server
  4. Run the client on another machine to connect to the server: ./flubber client ip.address.of.server:4080
  5. From either the server or the client, hit enter to send a 'ping' to the other side and receive a 'pong' back. The roundtrip time will be printed.

Expected
On machines plugged into a switch right next to each other, you should see a roundtrip time on the order of hundreds of microseconds.

Actual
The roundtrip of the ping is many seconds to many minutes. Below is sample output from two machines each running the program compiled with quic-go from the HEAD of master, d540f545b0b70217220eb0fbd5278ece436a7a20.

The server

# ./flubber server
Listening on port 4080…

Sending ping…
18.596644184s

Sending ping…
2m43.308296539s

The client

# ./flubber client 10.225.7.118:4080
Connected to 10.225.7.118:4080

Sending ping…
7.556763878s

Sending ping…
5m42.335324955s

If the program is started with QUIC_GO_DISABLE_GSO=true on just one side of the connection, the ping times are drastically reduced to ~30ms. That's still slow, but better. If both sides are started with QUIC_GO_DISABLE_GSO=true, the ping times are in the expected range of hundreds of microseconds.

NOTE
I wasn't sure if qlogs would be helpful in this case, but if they would be, I am happy to generate them.

Thank you for your time.

@marten-seemann
Copy link
Member

Minutes? That's incredibly long! How many pings per second are you sending?
It would be interesting to figure out where the delay comes from. Does the syscall block that long?

I can't really think of why, but there might also be a problem in quic-go's GSO code path:

func (s *connection) sendPacketsWithGSO(now time.Time) error {

We use a different code path to better make use of batching. Maybe there's a subtle bug somewhere there?

A qlog might be helpful, depending on where the problem lies. It would definitely be a good starting point. Would you mind recording one and posting it here?

@arashpayan
Copy link
Contributor Author

When I hit 'enter', it opens a single stream and writes the bytes []byte("ping") onto the stream. The other side accepts the stream, waits for the 'ping' bytes and then sends []byte("pong") back. Those minutes reflect how long it took to send that single ping and receive that single pong back in response. It's crazy to me too. We're talking about 4 bytes + overhead each way. It should be nothing.

I'll gather logs and share them here shortly.

@arashpayan
Copy link
Contributor Author

I ran the program to capture qlogs. Though one is technically the server that was listening for the incoming connection, they're really two peers.

I initiated two pings from the "server"

# ./flubber server
Listening on port 4080…

Sending ping…
22.683098082s

Sending ping…
3m56.770174461s

I also initiated two pings from the "client"

# ./flubber client 10.225.7.119:4080
Connected to 10.225.7.119:4080

Sending ping…
3.48358386s

Sending ping…
55.640109023s

Here are the logs:
https://soda.link/a25b43caaf7ffda56b9884b58bac7d90ace892_client.qlog
https://soda.link/a25b43caaf7ffda56b9884b58bac7d90ace892_server.qlog

Unfortunately, my SSH connection to the two machines timed out while I walked away from the computer. Hopefully that didn't introduce garbage into the end of the logs.

I'll try to instrument the code tomorrow see in what functions the program is spending its time. If you have any other things you want me to try, please don't hesitate to ask.

@marten-seemann
Copy link
Member

If you load the qlogs into qvis, you can see that there's an absolutely massive amount of packet loss happening. No wonder that things are slow!
qvis quictools info_ (1)

Now the question is what causes this packet loss. It would probably be helpful if you could capture a tcpdump on both machines, so we can see which packets are actually going over the wire. It might also be interesting to add some logging in writePacket, so we can see if the function parameters actually make sense:

quic-go/send_conn.go

Lines 85 to 92 in d540f54

func (c *sconn) writePacket(p []byte, addr net.Addr, oob []byte, gsoSize uint16, ecn protocol.ECN) error {
_, err := c.WritePacket(p, addr, oob, gsoSize, ecn)
if err != nil && !c.wroteFirstPacket && isPermissionError(err) {
_, err = c.WritePacket(p, addr, oob, gsoSize, ecn)
}
c.wroteFirstPacket = true
return err
}

@arashpayan
Copy link
Contributor Author

Hopefully I captured enough of what you needed.

I modified writePacket to look like:

func (c *sconn) writePacket(p []byte, addr net.Addr, oob []byte, gsoSize uint16, ecn protocol.ECN) error {
	slog.Info("writePacket", "hexP", hex.EncodeToString(p), "addr", addr.String(), "hexOOB", hex.EncodeToString(oob), "gsoSize", gsoSize, "ecn", ecn)
	_, err := c.WritePacket(p, addr, oob, gsoSize, ecn)
	if err != nil && !c.wroteFirstPacket && isPermissionError(err) {
		slog.Info("writePacket error path", "err", err)
		_, err = c.WritePacket(p, addr, oob, gsoSize, ecn)
	}
	c.wroteFirstPacket = true
	return err
}

Then I re-ran the test with tcpdump running on each machine, in addition to setting $QLOGDIR.

tcpdump on the client:

# tcpdump -s 0 -w tcpdump-client.pcap dst port 4080 and udp
dropped privs to tcpdump
tcpdump: listening on ens3f0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C33 packets captured
33 packets received by filter
0 packets dropped by kernel

And its pcap file: https://soda.link/tcpdump-client.pcap
writePacket logs: https://soda.link/flubber_client.log
qlog: https://soda.link/6df2767d09d4b5d41ddb895a_client.qlog

tcpdump on the server

# tcpdump -s 0 -w tcpdump-server.pcap src port 4080 and udp
dropped privs to tcpdump
tcpdump: listening on ens3f0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C29 packets captured
29 packets received by filter
0 packets dropped by kernel

And its pcap file: https://soda.link/tcpdump-server.pcap
writePacket logs: https://soda.link/flubber_client.log
qlog: https://soda.link/6df2767d09d4b5d41ddb895a_server.qlog

@marten-seemann
Copy link
Member

@arashpayan Sorry for the late response. I'm busy preparing the v0.43 release. I'll take a look at this once the release is out, hopefully next week.

@bt90
Copy link
Contributor

bt90 commented Apr 27, 2024

Kernel: 4.18.0-147.el8.x86_64

Fixed by #4456, I guess?

@marten-seemann
Copy link
Member

Possibly. And was one of the versions that was buggy, if I remember correctly. @arashpayan Could you rerun your experiment on a more recent (>= 5) kernel version?

@bt90
Copy link
Contributor

bt90 commented Apr 27, 2024

Kernel 5.x is only available on RHEL 9 as far as I'm aware.

@arashpayan
Copy link
Contributor Author

Unfortunately, I don't have any administrative control over the environment where I gathered those logs. I was allowed to gather them to debug it, but the system will be stuck with RHEL 8.1, kernel 4.18 until someone else decides to upgrade. I think it's probably safe to close this issue though. If I end up being able to reproduce it in the same environment with a kernel version >= 5, I'll comment here.

@marten-seemann
Copy link
Member

Thank you @arashpayan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants