Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shared client: Do not busy loop on errors #9

Merged
merged 1 commit into from
Nov 20, 2023

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Nov 16, 2023

Change the receive loop to wait for a trigger from the send loop to restart receiving after errors have been received. On any error all the current requests are canceled, so there is no point receiving before a new request is sent. This prevents receiver busy-looping on sticky error conditions, such as i/o timeout.

Also consider io.EOF as a connection close, as the TLS/TCP connections are read through the io package, which returns io.EOF on graceful connection close.

Explicitly wait for the write timeout when sending a new request to the handler, as the handler may be quitting at the same time and never actually handle the new request.

Fixes: cilium/cilium#29175

@jrajahalme jrajahalme added the bug Something isn't working label Nov 16, 2023

// wait for a trigger from the handler after any errors. Re-reading in
// this condition could busy loop, e.g., if a read timeout occurred.
_, ok := <-receiverTrigger
Copy link
Member

@bimmlerd bimmlerd Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this racy now?

If there is a trigger while we are failing in ReadMsg, the trigger can be missed even though subsequent read would succeed.
Let's say there are two requests. We read on the conn, and the first read fails with buffer too short. Isn't it possible that the second trigger occurs during the (failing) read of the first response, and hence we wait even though there may be a valid response available?

In terms of what happens when this race is hit, I think we would then leak goroutines? Since the receiver wouldn't send any kind of response (after the initial error), the handler would simply wait for more messages coming down the requests/responses channels, and the receiver would wait to be triggered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be right, but if the first error happens after the 2nd request has been sent, then also the 2nd request would be canceled due to the 1st error, i.e., there are no waiting requests after the error and it is proper to wait for a (new) trigger.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the requests would be cancelled, but if there are no more requests on that five tuple, we'd leak the goroutines. The read deadline won't save us as we never attempt to read again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

receiverTrigger may beed to be buffered with capacity of one, so that the trigger is not lost if sent (just) before the receive loop is read to receive it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty convinced now that there is a race as you noted, but making receiverTrigger buffered with capacity 1 would address it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as when both (all) of the requests get a reponse (or an error), they will release their reference to the shared client (by calling the returned closer function), at which point:

  • client.close() is called
  • c.requests is closed, which wakes up the handler, which then exits and
    • closes the connection
    • closes receiverTrigger channel
    • drains the responses, which allows the receive loop to proceed in case it was blocked on sending a response
  • The receive loop will then discover the closed trigger channel, which was closed before draining
  • If the receive loop was blocked on RecvMsg, then the connection closing will wake it up instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're absolutely right - I keep missing the larger context because I doesn't all fit in my head.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only possibility for the receive loop to hang would be for it to get blocked on sending of a response regardless the connection being closed, trigger channel being closed, and the responses being drained once, maybe if it tries to send to the responses channel right after the handler has just drained the responses, i.e.:

  • receive loop is receiving
  • receive loop gets a valid response
  • handler closes the connection & trigger
  • handler drains the responses channel and exits
  • receive loop writes on the responses channel and gets blocked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a possible scenario, then making the responses channel buffered (with capacity of 1) would solve this, IMO, as then the receive loop could always send one response after the drain has happened, and would be bound to find the closed connection before sending another response?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's possible. The for range draining of the responses channel will block until it is closed, which only happens after the last send has occurred, hence safe without buffering.

@jrajahalme jrajahalme force-pushed the shared-reader-error-fix branch 2 times, most recently from 1c316c7 to 870e215 Compare November 16, 2023 13:47
// No point trying to receive until the first request has been successfully sent, so
// wait for a trigger first. receiverTrigger is buffered, so this is safe
// to do, even if the sender sends the trigger before we are ready to receive here.
<-receiverTrigger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't the blocking semantics of ReadMsg take care of this anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't know the details on what happens to a currently blocking ReadMsg when the sending side is setting the read/write deadlines before each send operation. Likely it adjusts the timeouts, but could also wait forever based on the read timeout that was in effect when the blocking started?

So sending first/receiving after is safer as the underlying code has been written for that pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no the deadline will apply:

from net.Conn interface docs:

	// A deadline is an absolute time after which I/O operations
	// fail instead of blocking. The deadline applies to all future
	// and pending I/O, not just the immediately following call to
	// Read or Write. After a deadline has been exceeded, the
	// connection can be refreshed by setting a deadline in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging this up.

// handler is not reading on the channel after closing.
// UDP connections return net.ErrClosed, while TCP/TLS connections are read
// via the io package, which return io.EOF.
if errors.Is(err, net.ErrClosed) || errors.Is(err, io.EOF) {
Copy link
Member

@bimmlerd bimmlerd Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go, afaik, the semantics of EOF can be such that you receive a response and an io.EOF. If that's the case, we'd be dropping that response here, no?

From https://pkg.go.dev/io#Reader:

When Read encounters an error or end-of-file condition after successfully reading 
n > 0 bytes, it returns the number of bytes read. It may return the (non-nil) error
from the same call or return the error (and n == 0) from a subsequent call. An
instance of this general case is that a Reader returning a non-zero number of bytes 
at the end of the input stream may return either err == EOF or err == nil. The next 
Read should return 0, EOF. 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadMsg() does this:

func (co *Conn) ReadMsg() (*Msg, error) {
	p, err := co.ReadMsgHeader(nil)
	if err != nil {
		return nil, err
	}

and deeper down:

	if err != nil {
		return nil, err

So I don't think that happens, or it would have been an issue for a long time now. Maybe the fact that the client is in control of normal closing of the connection (for TCP) and that EOF does not really happen on UDP are also at play here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yeah good point - ReadMsg != Read. I suppose I'm just suprised to see an io.EOF bubbled up like that, but you're right, the code here is fine as is 👍

Change the receive loop to wait for a trigger from the send loop to
restart receiving after errors have been received. On any error all the
current requests are canceled, so there is no point receiving before a
new request is sent. This prevents receiver busy-looping on sticky error
conditions, such as i/o timeout.

Also consider io.EOF as a connection close, as the TLS/TCP connections
are read through the io package, which returns io.EOF on graceful
connection close.

Explicitly wait for the write timeout when sending a new request to the
handler, as the handler may be quitting at the same time and never
actually handle the new request.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme
Copy link
Member Author

Pushed a minor cleanup on comments.

Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is racy.


// wait for a trigger from the handler after any errors. Re-reading in
// this condition could busy loop, e.g., if a read timeout occurred.
_, ok := <-receiverTrigger
Copy link
Member

@bimmlerd bimmlerd Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a problem still, even with the buffered channel? Let's say we have the following ordering in a two request scenario:

  1. Receiver is started, waits on trigger.
  2. Handler receives two requests, and triggers twice. The second trigger is potentially ignored, as the channel contains a trigger already. (Crucially, I don't think there is a guarantee that the runtime will complete the receiver's read on the channel before the second send is attempted)
  3. The receiver starts reading on being triggered, and errors (say, ErrBufTooShort). It returns the error via clientResponse. It now waits on another trigger before reading.
  4. The handler receives the client response, and cancels the waiters. It waits for requests or responses.

Aren't we now waiting indefinitely, again?

@jrajahalme jrajahalme merged commit 7293451 into master Nov 20, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants