Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vendor: upgrade grpc from 1.13.0 to 1.21.2 #39041

Merged
merged 2 commits into from Jul 30, 2019

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Jul 22, 2019

This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this
upgrade is to eliminate the disconnections caused by
grpc/grpc-go#1882. These failures manifest themselves
as the following set of errors:

ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE

Which then lead to tripped breakers and general badness. I suspect that there
are several other good bug fixes in here, including some purported leaks and
correctness fixes on shutdown.

I have verified that with this upgrade I no longer see connections break in
overload scenarios which reliably reproduced the situation in the above log.

This commit removes one condition from grpcutil.IsClosedConnection which should
be subsumed by the status check above. The transport subpackage has not been
around for many releases.

This does not upgrade to the current release 1.22.0 because the maintainer
mentions that it contains a bug
(grpc/grpc-go#2663 (comment)).

Release note (bug fix): Upgrade grpc library to fix connection state management
bug.

@ajwerner ajwerner requested review from jordanlewis and a team July 22, 2019 22:24
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/upgrade_grpc_to_1_21 branch from a0a9752 to 60225a3 Compare July 22, 2019 23:18
@ajwerner
Copy link
Contributor Author

Grr the following PR grpc/grpc-go#2642 breaks TestGRPCKeepaliveFailureFailsInflightRPCs which expects to be able to control the Keepalive duration. Here the gRPC spec is working against us. I've updated the test to not time out until 30s but it makes it a 10s test. I'm open to skipping it.

This test however under stress hit grpc/grpc-go#2818 which is only fixed in 1.22 which is unfortunately subject to grpc/grpc-go#2862.

In summary there's not published release which is thought to be bug free. I'm tempted to pull master as it purports to contain bug fixes for all of these bugs and then to watch for the next release. Thoughts?

@jordanlewis
Copy link
Member

I don't feel particularly great about pulling master. What other bugs might be lurking?

What's the version released before 2642?

@ajwerner
Copy link
Contributor Author

1.18.1, I've verified that it passes, though seems somewhat flakey under stress on a different test. Digging in to that now.

Master also seems to fail the test unreliably due to what appears to be requests waiting on writeQuota. I can't tell if it's working as expected but it feels like a bug. The connection should be closed. I'll dig a little deeper and potentially file a grpc bug.

@ajwerner ajwerner force-pushed the ajwerner/upgrade_grpc_to_1_21 branch from 60225a3 to d213a78 Compare July 23, 2019 14:04
@ajwerner ajwerner requested a review from a team July 23, 2019 14:04
@ajwerner ajwerner changed the title vendor: upgrade grpc from 1.13.0 to 1.21.1 vendor: upgrade grpc from 1.13.0 to 1.18.1 Jul 23, 2019
@ajwerner
Copy link
Contributor Author

I saw this after 4 minutes of stress:

--- FAIL: TestStatsHandlerWithHeartbeats (45.10s)
    soon.go:35: condition failed to evaluate within 45s: expected server.incoming > client.outgoing; got 110, 134   

Digging. Otherwise this feels a bit better. Will also verify the disconnection behavior.

@ajwerner
Copy link
Contributor Author

There's no winning here. 1.18.1 contains grpc/grpc-go#2692 which causes the flakiness in TestStatsHandlerWithHeartbeats and also would just totally in general muck with our stats.

I think I'm going to go back to trying to pick up 1.21.1 and adjust the tests appropriately.

@ajwerner ajwerner force-pushed the ajwerner/upgrade_grpc_to_1_21 branch from 6efb47a to 8bd17e4 Compare July 24, 2019 13:53
@ajwerner ajwerner changed the title vendor: upgrade grpc from 1.13.0 to 1.18.1 vendor: upgrade grpc from 1.13.0 to 1.20.1 Jul 24, 2019
@ajwerner
Copy link
Contributor Author

Okay this is updated to use 1.21.1 and roughly RFAL.

This bump brings us to really long gRPC keepalives. It will mean that detecting network partitions from the client to the server will take ~3x longer but that's probably okay. The more annoying thing is that the test now takes much longer. The test also needed to be updated because it used to rely on having enough quota in flow control to send without blocking. It passes now reliably but I added a second commit to skip it because of its crazy long duration.

Another issue this brings up are that the stats in master (since 1.17) are totally broken (see grpc/grpc-go#2932). This process has been rather frustrating. I'm wondering if it's worth having a fork of grpc for little things like this. I can try to get this change backported but the gRPC maintainers seem to have a tendency to let PRs sit for a long time.

@ajwerner ajwerner changed the title vendor: upgrade grpc from 1.13.0 to 1.20.1 vendor: upgrade grpc from 1.13.0 to 1.21.1 Jul 24, 2019
@ajwerner ajwerner force-pushed the ajwerner/upgrade_grpc_to_1_21 branch 3 times, most recently from 7130a95 to 4817f3e Compare July 25, 2019 18:32
@ajwerner ajwerner changed the title vendor: upgrade grpc from 1.13.0 to 1.21.1 vendor: upgrade grpc from 1.13.0 to 1.21.2 Jul 25, 2019
This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this
upgrade is to eliminate the disconnections caused by
grpc/grpc-go#1882. These failures manifest themselves
as the following set of errors:

```
ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE
```
Which then lead to tripped breakers and general badness. I suspect that there
are several other good bug fixes in here, including some purported leaks and
correctness fixes on shutdown.

I have verified that with this upgrade I no longer see connections break in
overload scenarios which reliably reproduced the situation in the above log.

This commit removes one condition from grpcutil.IsClosedConnection which should
be subsumed by the status check above. The `transport` subpackage has not been
around for many releases.

This does not upgrade to the current release 1.22.0 because the maintainer
mentions that it contains a bug
(grpc/grpc-go#2663 (comment)).

This change also unfortunately updates the keepalive behavior to be more spec
compliant (grpc/grpc-go#2642). This change mandates
a minimum ping time of 10s to the client. Given grpc/grpc-go#2638
this means that the rpc test for keepalives now takes over 20s.

I would be okay skipping it but leave that discussion for review.

Also updated the acceptance test to look out for an HTTP/2.0 header because
grpc now does not send RPCs until after the HTTP handshake has completed
(see grpc/grpc-go#2406).

Release note (bug fix): Upgrade grpc library to fix connection state management
bug.
This test now takes 21 seconds which is too long.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/upgrade_grpc_to_1_21 branch from 4817f3e to 03aac64 Compare July 25, 2019 18:53
@ajwerner
Copy link
Contributor Author

Okay, good news, I got grpc/grpc-go#2932 merged, backported, and released as 1.21.2. I also defeated the linter. This is now truly RFAL. @jordanlewis if you want to pass this off to another reviewer feel free.

@ajwerner
Copy link
Contributor Author

@bdarnell or @asubiotto interested in having a look?

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: this seems like it closes #31361, right?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)

Copy link
Member

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 11 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)

@ajwerner
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Jul 30, 2019
39041: vendor: upgrade grpc from 1.13.0 to 1.21.2 r=ajwerner a=ajwerner

This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this
upgrade is to eliminate the disconnections caused by
grpc/grpc-go#1882. These failures manifest themselves
as the following set of errors:

```
ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE
```
Which then lead to tripped breakers and general badness. I suspect that there
are several other good bug fixes in here, including some purported leaks and
correctness fixes on shutdown.

I have verified that with this upgrade I no longer see connections break in
overload scenarios which reliably reproduced the situation in the above log.

This commit removes one condition from grpcutil.IsClosedConnection which should
be subsumed by the status check above. The `transport` subpackage has not been
around for many releases.

This does not upgrade to the current release 1.22.0 because the maintainer
mentions that it contains a bug
(grpc/grpc-go#2663 (comment)).

Release note (bug fix): Upgrade grpc library to fix connection state management
bug.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@ajwerner
Copy link
Contributor Author

TFTR! I'm going to close #31361 in favor of #34229 and #38602.

@craig
Copy link
Contributor

craig bot commented Jul 30, 2019

Build succeeded

@craig craig bot merged commit 03aac64 into cockroachdb:master Jul 30, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Aug 23, 2019
A test failure has been observed with the following error:

```
503 Service Unavailable, content-type: application/json, body: {
                  "error": "all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: failed to write client preface: io: read/write on closed pipe\"",
                  "message": "all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: failed to write client preface: io: read/write on closed pipe\"",
                  "code": 14,
                  "details": [
                  ]
                }, error: <nil>
```

That error makes some sense if the DefaultClass connection has failed
for some reason. In the fullness of time we should get to the bottom
of why these gRPC connections still close when overloaded. We had
hoped that cockroachdb#39041 would be the end of that but unfortunately it
still seems to happen sometimes. That being said, the situation
resolves itself rapidly when the load stops. This PR adds a retry
loop to make the system robust to these transient failures.

It's also possible that we should run timeseries queries as
SystemClass operations but I'll also leave that for another
change.

Release note: None
nvanbenschoten pushed a commit to ajwerner/cockroach that referenced this pull request Aug 26, 2019
A test failure has been observed with the following error:

```
503 Service Unavailable, content-type: application/json, body: {
                  "error": "all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: failed to write client preface: io: read/write on closed pipe\"",
                  "message": "all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: failed to write client preface: io: read/write on closed pipe\"",
                  "code": 14,
                  "details": [
                  ]
                }, error: <nil>
```

That error makes some sense if the DefaultClass connection has failed
for some reason. In the fullness of time we should get to the bottom
of why these gRPC connections still close when overloaded. We had
hoped that cockroachdb#39041 would be the end of that but unfortunately it
still seems to happen sometimes. That being said, the situation
resolves itself rapidly when the load stops. This PR adds a retry
loop to make the system robust to these transient failures.

It's also possible that we should run timeseries queries as
SystemClass operations but I'll also leave that for another
change.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants