shared_client: Bump request id #12

jrajahalme · 2024-04-25T10:15:12Z

Only fail out if non-conflicting request id can not be found.

This works on the premise that the callers are fine with the request id
being modified at this point. Current use sets a random id just prior to
Exchange call, so this premise is satisfied.

gandro

This seems harmless to me, but I'm still not confident in my understanding of SharedClient.

cc @marseel if you also want to take a look, since you know that particular piece better than I do

marseel · 2024-04-25T14:27:51Z

I was thinking about it yesterday :)

I think it looks alright, but the main trigger for this issue is probably something like this:

We are sending a constant stream of DNS requests
For some of them we get responses, for some we don't
We never time out waiting for requests in waitingResponses and they get stuck there - this happens because we have only a single timeout here that is never triggered as we still receive some of the responses.

So I think this change definitely makes sense, but more important IMHO would be to have separate timeouts per request in waitingResponses

Does it make sense @jrajahalme ?

jrajahalme · 2024-04-25T15:19:39Z

I was thinking about it yesterday :)

I think it looks alright, but the main trigger for this issue is probably something like this:

We are sending a constant stream of DNS requests

For some of them we get responses, for some we don't

We never time out waiting for requests in waitingResponses and they get stuck there - this happens because we have only a single timeout here that is never triggered as we still receive some of the responses.

So I think this change definitely makes sense, but more important IMHO would be to have separate timeouts per request in waitingResponses

Does it make sense @jrajahalme ?

Makes total sense. I'll look into adding request-specific timeouts.

jrajahalme · 2024-04-29T19:02:06Z

@marseel Added per-request timeout handling via a deadline queue using container/heap, please have a look!

marseel

@gandro probably you want to take another look as well :)

marseel · 2024-04-30T07:26:41Z

shared_client.go

+	if wq.Len() == 0 {
+		return 10 * time.Minute
+	}
+	return wq.waiters[0].deadline.Sub(time.Now())


nit: return time.Until(wq.waiters[0].deadline)

marseel · 2024-04-30T08:50:34Z

shared_client.go

+			break
+		}
+		wtr := heap.Pop(wq).(*waiter)
+		wtr.ch <- sharedClientResponse{nil, 0, context.DeadlineExceeded}


We will need to change https://github.com/cilium/cilium/blob/931b8167ea29bfd3ae8e6f11f41a8a1c531c33c8/pkg/fqdn/dnsproxy/proxy.go#L599
or switch this error to net.Error Timeout IIUC.

Good point, not sure how to return a net.Error Timeout though?

Have to implement net.Error interface:

// errTimeout is an an error representing a request timeout. // Implements net.Error type errTimeout struct { } func (e errTimeout) Timeout() bool { return true } // Temporary is deprecated. Return false. func (e errTimeout) Temporary() bool { return false } func (e errTimeout) Error() string { return "request timed out" } var netErrorTimeout errTimeout

marseel · 2024-04-30T09:01:26Z

shared_client.go

+			if waiters.Exists(req.msg.Id) {
+				// find next available ID
+				duplicate := true
+				for id := req.msg.Id + 1; id != req.msg.Id; id++ {


I'm not sure how safe it is to use kind of "predictable" Ids - https://nvd.nist.gov/vuln/detail/CVE-2008-0087 . I would probably try a few times (for example 3) Id() instead. That would also give a more predictable runtime.

Each request starts from a random number (via Id()) by the caller), so it should not be likely that the Id's of consecutive requests would be sequential. This relies on the caller, though.

IMO 3 random tries might not work if there are a lot of requests in flight?

IMO 3 random tries might not work if there are a lot of requests in flight?

true, but that would require ~10s of thousands of inflight requests - meaning probably something else is wrong already. I would consider using fixed-number as a circuit-breaker.

For example, with 32k inflight requests, we would still have a chance of 1 - (32000/65000)^4 ~= 0.94 of getting a new correct ID.

Done, with 5 tries.

I would consider using fixed-number as a circuit-breaker.

Did not get this, though?

ah, nvm that's what I meant essentially, fixed loop.

Only fail out if non-conflicting request id can not be found in couple of tries. This works on the premise that the callers are fine with the request id being modified at this point. Current use sets a random id just prior to Exchange call, so this premise is satisfied. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

Buffer responses channel so that the handler does not get blocked of the channel is not received from (e.g., after a timeout). Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

Use the configured read timeout to bound the time spent on receiving the response, instead of waiting for a full minute. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

Drain requests on handler close, so that pending requests are terminated immediately when handler needs to close for an error condition, rather than having the requests time out. This allows the handler to be recycled faster for new requests. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

container/heap uses `any`, which was added in Go 1.18. Bump tested Go versions to accomodate for this. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

Tell handler to delete waiters after request times out. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

gandro

I need to do a more in-depth review of the last commit, but here are is a bit of feedback from a first pass

gandro · 2024-04-30T14:35:52Z

shared_client.go

+		// Drain requests in case they come in while we are closing
+		// down. This loop is done only after 'requests' channel is closed in
+		// SharedClient.close() and it is not possible for new requests or timeouts
+		// to be sent on those closed channels.


I don't follow this comment. Sending on closed channel panics - surely that's not how we prevent the senders from sending requests? I assume the real check from preventing senders is that conn is nil, right?

SharedClient.close() (note: not a close(<channel>)) is called when the shared client can no longer be used for new requests. So there is nothing sent to the closed channel. The point of the comment is that the range loop on the channel completes only after the channel is closed (by the side sending to the channel), so we are guaranteed to send replies to all requests received on this channel.

gandro · 2024-04-30T14:54:04Z

.github/workflows/go.yml

@@ -7,7 +7,7 @@ jobs:
    runs-on: ubuntu-latest
    strategy:
      matrix:
-        go: [ 1.17.x, 1.18.x ]
+        go: [ 1.18.x, 1.19.x ]


unrelated nit: Given the sheer amount of highly concurrent code in shared client, we should also run with go test -race in this workflow

gandro · 2024-04-30T14:56:13Z

shared_client.go

 	for {
+		// update timer
+		deadline.Reset(waiters.GetTimeout())


Reset is not safe to use when the timer is not drained https://pkg.go.dev/time#Timer.Reset

How do we know that the timer is drained here?

Thanks for noting this, will have to work around this...

jrajahalme · 2024-05-03T07:42:47Z

@marseel FYI we are seeing CI flakes due to ID collisions in cilium main, so it is prudent to at least be retrying to get a good ID (like in this PR):

  ❌ Found 1 logs in kind-kind/kube-system/cilium-jgxqz (cilium-agent) matching list of errors that must be investigated:
time="2024-05-03T07:20:54Z" level=error msg="Cannot forward proxied DNS lookup" DNSRequestID=45017 dnsName=one.one.one.one. endpointID=1252 error="duplicate request id 224" identity=10831 ipAddr="10.244.2.92:47175" subsys=fqdn/dnsproxy (1 occurrences)

marseel · 2024-05-06T07:16:11Z

FYI we are seeing CI flakes due to ID collisions in cilium main, so it is prudent to at least be retrying to get a good ID (like in this PR):

Interesting, that would mean we were hitting this issue even with a low number of concurrent requests 🤔
In that case, I guess we could merge retry only first to mitigate these failures and then follow-up with heap and timeouts.

jrajahalme · 2024-05-06T09:45:42Z

FYI we are seeing CI flakes due to ID collisions in cilium main, so it is prudent to at least be retrying to get a good ID (like in this PR):

Interesting, that would mean we were hitting this issue even with a low number of concurrent requests 🤔 In that case, I guess we could merge retry only first to mitigate these failures and then follow-up with heap and timeouts.

Here's a PR for the 1st commit only: #13

jrajahalme requested a review from gandro April 25, 2024 13:50

gandro reviewed Apr 25, 2024

View reviewed changes

jrajahalme force-pushed the shared-client-bump-id branch from 0171a40 to b993c3d Compare April 29, 2024 19:01

marseel approved these changes Apr 30, 2024

View reviewed changes

jrajahalme requested a review from gandro April 30, 2024 10:17

jrajahalme force-pushed the shared-client-bump-id branch from b993c3d to b4f3e11 Compare April 30, 2024 11:15

jrajahalme added 6 commits April 30, 2024 13:20

shared_client: Buffer responses channel

022c597

Buffer responses channel so that the handler does not get blocked of the channel is not received from (e.g., after a timeout). Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

shared_client: Use configured timeout also for response

f027334

Use the configured read timeout to bound the time spent on receiving the response, instead of waiting for a full minute. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

GH: Bump tested Go versions to v1.18 & v1.19

67f6b75

container/heap uses `any`, which was added in Go 1.18. Bump tested Go versions to accomodate for this. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

shared_client: Clean up waiters after timeouts

5724e90

Tell handler to delete waiters after request times out. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

jrajahalme force-pushed the shared-client-bump-id branch from b4f3e11 to 5724e90 Compare April 30, 2024 11:22

gandro reviewed Apr 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shared_client: Bump request id #12

shared_client: Bump request id #12

jrajahalme commented Apr 25, 2024

gandro left a comment

marseel commented Apr 25, 2024 •

edited

jrajahalme commented Apr 25, 2024

jrajahalme commented Apr 29, 2024

marseel left a comment

marseel Apr 30, 2024

jrajahalme Apr 30, 2024

marseel Apr 30, 2024

jrajahalme Apr 30, 2024

jrajahalme Apr 30, 2024

jrajahalme Apr 30, 2024

marseel Apr 30, 2024

jrajahalme Apr 30, 2024

marseel Apr 30, 2024

jrajahalme Apr 30, 2024

jrajahalme Apr 30, 2024

marseel Apr 30, 2024

gandro left a comment

gandro Apr 30, 2024

jrajahalme May 3, 2024

gandro Apr 30, 2024

gandro Apr 30, 2024 •

edited

jrajahalme May 3, 2024

jrajahalme commented May 3, 2024

marseel commented May 6, 2024

jrajahalme commented May 6, 2024

shared_client: Bump request id #12

Are you sure you want to change the base?

shared_client: Bump request id #12

Conversation

jrajahalme commented Apr 25, 2024

gandro left a comment

Choose a reason for hiding this comment

marseel commented Apr 25, 2024 • edited

jrajahalme commented Apr 25, 2024

jrajahalme commented Apr 29, 2024

marseel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gandro Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrajahalme commented May 3, 2024

marseel commented May 6, 2024

jrajahalme commented May 6, 2024

marseel commented Apr 25, 2024 •

edited

gandro Apr 30, 2024 •

edited