shared client: Do not busy loop on errors #9

jrajahalme · 2023-11-16T10:16:15Z

Change the receive loop to wait for a trigger from the send loop to restart receiving after errors have been received. On any error all the current requests are canceled, so there is no point receiving before a new request is sent. This prevents receiver busy-looping on sticky error conditions, such as i/o timeout.

Also consider io.EOF as a connection close, as the TLS/TCP connections are read through the io package, which returns io.EOF on graceful connection close.

Explicitly wait for the write timeout when sending a new request to the handler, as the handler may be quitting at the same time and never actually handle the new request.

Fixes: cilium/cilium#29175

bimmlerd · 2023-11-16T10:28:51Z

shared_client.go

+
+			// wait for a trigger from the handler after any errors. Re-reading in
+			// this condition could busy loop, e.g., if a read timeout occurred.
+			_, ok := <-receiverTrigger


Isn't this racy now?

If there is a trigger while we are failing in ReadMsg, the trigger can be missed even though subsequent read would succeed.
Let's say there are two requests. We read on the conn, and the first read fails with buffer too short. Isn't it possible that the second trigger occurs during the (failing) read of the first response, and hence we wait even though there may be a valid response available?

In terms of what happens when this race is hit, I think we would then leak goroutines? Since the receiver wouldn't send any kind of response (after the initial error), the handler would simply wait for more messages coming down the requests/responses channels, and the receiver would wait to be triggered.

You may be right, but if the first error happens after the 2nd request has been sent, then also the 2nd request would be canceled due to the 1st error, i.e., there are no waiting requests after the error and it is proper to wait for a (new) trigger.

I agree that the requests would be cancelled, but if there are no more requests on that five tuple, we'd leak the goroutines. The read deadline won't save us as we never attempt to read again.

receiverTrigger may beed to be buffered with capacity of one, so that the trigger is not lost if sent (just) before the receive loop is read to receive it?

I'm pretty convinced now that there is a race as you noted, but making receiverTrigger buffered with capacity 1 would address it.

No, as when both (all) of the requests get a reponse (or an error), they will release their reference to the shared client (by calling the returned closer function), at which point:

client.close() is called

c.requests is closed, which wakes up the handler, which then exits and

closes the connection

closes receiverTrigger channel

drains the responses, which allows the receive loop to proceed in case it was blocked on sending a response

The receive loop will then discover the closed trigger channel, which was closed before draining

If the receive loop was blocked on RecvMsg, then the connection closing will wake it up instead.

Yep, you're absolutely right - I keep missing the larger context because I doesn't all fit in my head.

The only possibility for the receive loop to hang would be for it to get blocked on sending of a response regardless the connection being closed, trigger channel being closed, and the responses being drained once, maybe if it tries to send to the responses channel right after the handler has just drained the responses, i.e.:

receive loop is receiving

receive loop gets a valid response

handler closes the connection & trigger

handler drains the responses channel and exits

receive loop writes on the responses channel and gets blocked?

If this is a possible scenario, then making the responses channel buffered (with capacity of 1) would solve this, IMO, as then the receive loop could always send one response after the drain has happened, and would be bound to find the closed connection before sending another response?

I don't think that's possible. The for range draining of the responses channel will block until it is closed, which only happens after the last send has occurred, hence safe without buffering.

bimmlerd · 2023-11-16T14:05:30Z

shared_client.go

+		// No point trying to receive until the first request has been successfully sent, so
+		// wait for a trigger first. receiverTrigger is buffered, so this is safe
+		// to do, even if the sender sends the trigger before we are ready to receive here.
+		<-receiverTrigger


Don't the blocking semantics of ReadMsg take care of this anyway?

Yes, but I don't know the details on what happens to a currently blocking ReadMsg when the sending side is setting the read/write deadlines before each send operation. Likely it adjusts the timeouts, but could also wait forever based on the read timeout that was in effect when the blocking started?

So sending first/receiving after is safer as the underlying code has been written for that pattern.

Oh, no the deadline will apply:

from net.Conn interface docs:

// A deadline is an absolute time after which I/O operations // fail instead of blocking. The deadline applies to all future // and pending I/O, not just the immediately following call to // Read or Write. After a deadline has been exceeded, the // connection can be refreshed by setting a deadline in the future.

Thanks for digging this up.

bimmlerd · 2023-11-16T14:07:14Z

shared_client.go

+			// handler is not reading on the channel after closing.
+			// UDP connections return net.ErrClosed, while TCP/TLS connections are read
+			// via the io package, which return io.EOF.
+			if errors.Is(err, net.ErrClosed) || errors.Is(err, io.EOF) {


In Go, afaik, the semantics of EOF can be such that you receive a response and an io.EOF. If that's the case, we'd be dropping that response here, no?

From https://pkg.go.dev/io#Reader:

When Read encounters an error or end-of-file condition after successfully reading n > 0 bytes, it returns the number of bytes read. It may return the (non-nil) error from the same call or return the error (and n == 0) from a subsequent call. An instance of this general case is that a Reader returning a non-zero number of bytes at the end of the input stream may return either err == EOF or err == nil. The next Read should return 0, EOF.

ReadMsg() does this:

func (co *Conn) ReadMsg() (*Msg, error) { p, err := co.ReadMsgHeader(nil) if err != nil { return nil, err }

and deeper down:

if err != nil { return nil, err

So I don't think that happens, or it would have been an issue for a long time now. Maybe the fact that the client is in control of normal closing of the connection (for TCP) and that EOF does not really happen on UDP are also at play here?

Hm yeah good point - ReadMsg != Read. I suppose I'm just suprised to see an io.EOF bubbled up like that, but you're right, the code here is fine as is 👍

Change the receive loop to wait for a trigger from the send loop to restart receiving after errors have been received. On any error all the current requests are canceled, so there is no point receiving before a new request is sent. This prevents receiver busy-looping on sticky error conditions, such as i/o timeout. Also consider io.EOF as a connection close, as the TLS/TCP connections are read through the io package, which returns io.EOF on graceful connection close. Explicitly wait for the write timeout when sending a new request to the handler, as the handler may be quitting at the same time and never actually handle the new request. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

jrajahalme · 2023-11-16T14:19:52Z

Pushed a minor cleanup on comments.

bimmlerd

I still think this is racy.

bimmlerd · 2023-11-20T10:22:18Z

shared_client.go

+
+			// wait for a trigger from the handler after any errors. Re-reading in
+			// this condition could busy loop, e.g., if a read timeout occurred.
+			_, ok := <-receiverTrigger


Isn't there a problem still, even with the buffered channel? Let's say we have the following ordering in a two request scenario:

Receiver is started, waits on trigger.

Handler receives two requests, and triggers twice. The second trigger is potentially ignored, as the channel contains a trigger already. (Crucially, I don't think there is a guarantee that the runtime will complete the receiver's read on the channel before the second send is attempted)

The receiver starts reading on being triggered, and errors (say, ErrBufTooShort). It returns the error via clientResponse. It now waits on another trigger before reading.

The handler receives the client response, and cancels the waiters. It waits for requests or responses.

Aren't we now waiting indefinitely, again?

jrajahalme added the bug Something isn't working label Nov 16, 2023

bimmlerd reviewed Nov 16, 2023

View reviewed changes

jrajahalme requested review from aanm and bimmlerd November 16, 2023 10:34

jrajahalme force-pushed the shared-reader-error-fix branch 2 times, most recently from 1c316c7 to 870e215 Compare November 16, 2023 13:47

bimmlerd reviewed Nov 16, 2023

View reviewed changes

jrajahalme force-pushed the shared-reader-error-fix branch from 870e215 to 496cd01 Compare November 16, 2023 14:19

jrajahalme requested a review from bimmlerd November 16, 2023 14:22

aanm approved these changes Nov 20, 2023

View reviewed changes

bimmlerd reviewed Nov 20, 2023

View reviewed changes

bimmlerd approved these changes Nov 20, 2023

View reviewed changes

jrajahalme force-pushed the shared-reader-error-fix branch from b0b6f18 to 496cd01 Compare November 20, 2023 14:01

jrajahalme merged commit 7293451 into master Nov 20, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shared client: Do not busy loop on errors #9

shared client: Do not busy loop on errors #9

jrajahalme commented Nov 16, 2023 •

edited

bimmlerd Nov 16, 2023 •

edited

jrajahalme Nov 16, 2023

bimmlerd Nov 16, 2023

jrajahalme Nov 16, 2023

jrajahalme Nov 16, 2023

jrajahalme Nov 20, 2023 •

edited

bimmlerd Nov 20, 2023

jrajahalme Nov 20, 2023

jrajahalme Nov 20, 2023

bimmlerd Nov 20, 2023

bimmlerd Nov 16, 2023

jrajahalme Nov 16, 2023 •

edited

bimmlerd Nov 16, 2023

jrajahalme Nov 16, 2023

bimmlerd Nov 16, 2023 •

edited

jrajahalme Nov 16, 2023

bimmlerd Nov 16, 2023

jrajahalme commented Nov 16, 2023

bimmlerd left a comment

bimmlerd Nov 20, 2023 •

edited

shared client: Do not busy loop on errors #9

shared client: Do not busy loop on errors #9

Conversation

jrajahalme commented Nov 16, 2023 • edited

bimmlerd Nov 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrajahalme Nov 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrajahalme Nov 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bimmlerd Nov 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrajahalme commented Nov 16, 2023

bimmlerd left a comment

Choose a reason for hiding this comment

bimmlerd Nov 20, 2023 • edited

Choose a reason for hiding this comment

jrajahalme commented Nov 16, 2023 •

edited

bimmlerd Nov 16, 2023 •

edited

jrajahalme Nov 20, 2023 •

edited

jrajahalme Nov 16, 2023 •

edited

bimmlerd Nov 16, 2023 •

edited

bimmlerd Nov 20, 2023 •

edited