Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: client keepalives should be sent every [Time] period #3102

Merged
merged 5 commits into from Oct 24, 2019

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Oct 15, 2019

This commit makes the following changes:

  • Keep track of the time of the last read in the transport.
  • Use this in the keepalive implementation to decide when to send out
    keepalives.
  • Address the issue of keepalives being sent every [Time+Timeout] period
    instead of every [Time] period, as mandated by proposal A8.

Proposal A8 is here:
https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md

This commit makes the following changes:
- Keep track of the time of the last read in the transport.
- Use this in the keepalive implementation to decide when to send out
  keepalives.
- Address the issue of keepalives being sent every [Time+Timeout] period
  instead of every [Time] period, as mandated by proposal A8.

Proposal A8 is here:
https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md
@easwars
Copy link
Contributor Author

easwars commented Oct 15, 2019

This PR supersedes #2790, which it became unmanageable to fix the merge conflicts, given for how long it has been open.

@easwars
Copy link
Contributor Author

easwars commented Oct 15, 2019

I ran the benchmarks with the following options::

Network: Local
EnableKeepalive: true
BenchTime: 1m0s
Trace: false
Latency: 0s
Kbps: 0
MTU: 0
Callers: 64
ReqSize: 1B
RespSize: 1B
Compressor: nop
Channelz: false
Preloader: false

and it returned the following results::

           Title       Before        After Percentage
        TotalOps      4722025      4782617     1.28%
         SendOps            0            0      NaN%
         RecvOps            0            0      NaN%
        Bytes/op      5357.89      5356.93    -0.02%
       Allocs/op        42.69        42.67     0.00%
         ReqT/op    629603.33    637682.27     1.28%
        RespT/op    629603.33    637682.27     1.28%
        50th-Lat    757.185µs    748.542µs    -1.14%
        90th-Lat   1.060623ms   1.043457ms    -1.62%
        99th-Lat   2.384094ms   2.359822ms    -1.02%
         Avg-Lat    812.112µs    802.012µs    -1.24%

which is quite surprising that it shows a performance improvement. But I tried running it multiple times and it seems to be consistent.

@easwars
Copy link
Contributor Author

easwars commented Oct 15, 2019

32-bit tests on travis seem to be unhappy, even though I have put the timeNano field in a struct to guarantee 64-bit alignment. I'm able to easy reproduce the segfault locally with GOARCH=386 go test -cpu 4 -timeout 7m google.golang.org/grpc/test/....

Am I missing something here?

@easwars
Copy link
Contributor Author

easwars commented Oct 15, 2019

Moving the lr field to the top of http2Client does seem to fix the issue. But I don't seem to understand why the earlier approach didn't work. It used to work in #2790, but at that time lastRead struct also had another field.

So, maybe the compiler just optimized the whole struct out and inlined the field into the http2Client field?

Should I just get rid of the lastRead struct completely and just put the int64 field at the top of http2Client?

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes care of the client - there should be similar changes necessary on the server too, right?

@@ -1322,36 +1351,37 @@ func (t *http2Client) keepalive() {
return
}
if len(t.activeStreams) < 1 && !t.kp.PermitWithoutStream {
// If a ping was sent out previously (because there were active
// streams at that point) which wasn't acked and it's timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language nit: it's -> its

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// acked).
sleepDuration := minTime(t.kp.Time, timeoutLeft)
timeoutLeft -= sleepDuration
prevNano = time.Now().UnixNano()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do this here? prevNano is used to compare against lastRead. If lastRead is updated between line 1330 and here, we will fail to count that activity.

Consider a crazy race where the ping sent on 1373 is ack'd and that activity is recorded before the code gets here. Yes this is extremely unlikely, but it's theoretically possible which means something is wrong.

Should prevNano be set to the lastRead time (always, including line 1333)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to set prevNano to lastRead all the time.

On 1336: Where the timer has fired, but we have had read activity since the last time we were here, it definitely makes sense to set prevNano to lastRead because we are effectively setting the timer to fire kp.Time seconds after lastRead. So, the next time we get here, we should check if there has been read activity since then. This also eliminates the extremely unlikely race of having read_activity after our atomic read of t.lr.timeNano and the setting of prevNano to time.Now here (and no read_activty till the timer fires the next time).

Here around 1384 also, it makes sense to set it to lastRead. It took me some time to convince myself that this works also for the case where we have just come out of dormancy and have sent out a ping, but I feel convinced now. Please let me know if you feel otherwise.

Thanks for this comment.

func TestKeepaliveClientFrequency(t *testing.T) {
serverConfig := &ServerConfig{
KeepalivePolicy: keepalive.EnforcementPolicy{
MinTime: 2100 * time.Millisecond, // 2.1 seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be even more aggressive, like 1.1s? Or will it be flaky (because Travis)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially mistaken to think that the server waits for 2 pingStrikes within the configured MinTime before sending a GOAWAY.
Changed it to 1.2 seconds, and the test passed with the race detector for 50 runs.

@dfawley dfawley assigned easwars and unassigned dfawley Oct 16, 2019
@dfawley dfawley added this to the 1.25 Release milestone Oct 16, 2019
Copy link
Contributor Author

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL.

@@ -1322,36 +1351,37 @@ func (t *http2Client) keepalive() {
return
}
if len(t.activeStreams) < 1 && !t.kp.PermitWithoutStream {
// If a ping was sent out previously (because there were active
// streams at that point) which wasn't acked and it's timeout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// acked).
sleepDuration := minTime(t.kp.Time, timeoutLeft)
timeoutLeft -= sleepDuration
prevNano = time.Now().UnixNano()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to set prevNano to lastRead all the time.

On 1336: Where the timer has fired, but we have had read activity since the last time we were here, it definitely makes sense to set prevNano to lastRead because we are effectively setting the timer to fire kp.Time seconds after lastRead. So, the next time we get here, we should check if there has been read activity since then. This also eliminates the extremely unlikely race of having read_activity after our atomic read of t.lr.timeNano and the setting of prevNano to time.Now here (and no read_activty till the timer fires the next time).

Here around 1384 also, it makes sense to set it to lastRead. It took me some time to convince myself that this works also for the case where we have just come out of dormancy and have sent out a ping, but I feel convinced now. Please let me know if you feel otherwise.

Thanks for this comment.

func TestKeepaliveClientFrequency(t *testing.T) {
serverConfig := &ServerConfig{
KeepalivePolicy: keepalive.EnforcementPolicy{
MinTime: 2100 * time.Millisecond, // 2.1 seconds
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially mistaken to think that the server waits for 2 pingStrikes within the configured MinTime before sending a GOAWAY.
Changed it to 1.2 seconds, and the test passed with the race detector for 50 runs.

@easwars
Copy link
Contributor Author

easwars commented Oct 17, 2019

This takes care of the client - there should be similar changes necessary on the server too, right?

That is true. I could do it in a follow-up PR.

@easwars easwars changed the title Keepalive pings should be sent every [Time] period client: Keepalive pings should be sent every [Time] period Oct 17, 2019
@easwars easwars assigned dfawley and unassigned easwars Oct 17, 2019
@dfawley dfawley assigned dfawley and unassigned dfawley Oct 17, 2019
@@ -47,6 +47,7 @@ import (

// http2Client implements the ClientTransport interface with HTTP2.
type http2Client struct {
lr lastRead // keep this field 64-bit aligned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this int64 and delete the struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -350,6 +350,53 @@ func TestKeepaliveClientStaysHealthyWithResponsiveServer(t *testing.T) {
}
}

// TestKeepaliveClientFrequency creates a server which expects at most 2 client
// pings for every 2.1 seconds, while the client is configured to send a ping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text needs to be updated in light of the number changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't realize there was one more comment needing my attention and that I had not updated the thread. PTAL.

@dfawley
Copy link
Member

dfawley commented Oct 23, 2019

Sorry, didn't realize there was one more comment needing my attention and that I had not updated the thread. PTAL.

Apparently it was my fault because I changed the assignee from me to me instead of me to you. 🤦‍♂️

@easwars easwars deleted the keepalive_new branch November 11, 2019 23:02
easwars added a commit to easwars/grpc-go that referenced this pull request Nov 12, 2019
This PR contains the server side changes corresponding to the client
side changes made in grpc#3102.

Apart from the fix for the issue mentioned in
grpc#2638, this PR also makes some
minor code cleanup and fixes the channelz test for keepalives count.
easwars added a commit that referenced this pull request Nov 19, 2019
This PR contains the server side changes corresponding to the client
side changes made in #3102.

Apart from the fix for the issue mentioned in
#2638, this PR also makes some
minor code cleanup and fixes the channelz test for keepalives count.
RealBar pushed a commit to RealBar/grpc-go that referenced this pull request Nov 21, 2019
This PR contains the server side changes corresponding to the client
side changes made in grpc#3102.

Apart from the fix for the issue mentioned in
grpc#2638, this PR also makes some
minor code cleanup and fixes the channelz test for keepalives count.
@lock lock bot locked as resolved and limited conversation to collaborators May 20, 2020
@easwars easwars changed the title client: Keepalive pings should be sent every [Time] period transport: client keepalives should be sent every [Time] period Mar 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants