Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/w=100/nodes=3/chaos=true failed #80689

Closed
cockroach-teamcity opened this issue Apr 28, 2022 · 9 comments · Fixed by #80878
Closed

roachtest: tpcc/w=100/nodes=3/chaos=true failed #80689

cockroach-teamcity opened this issue Apr 28, 2022 · 9 comments · Fixed by #80878
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). deprecated-branch-release-22.1.0 O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 28, 2022

roachtest.tpcc/w=100/nodes=3/chaos=true failed with artifacts on release-22.1.0 @ e21edbd526db0c8e7ee8e52e6083af626918f2ca:

The test failed on branch=release-22.1.0, cloud=gce:
test artifacts and logs in: /artifacts/tpcc/w=100/nodes=3/chaos=true/run_1
	monitor.go:127,tpcc.go:259,tpcc.go:657,test_runner.go:875: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 134)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCC
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:259
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func10
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:657
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 3: dead (exit status 134)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-15953

@cockroach-teamcity cockroach-teamcity added deprecated-branch-release-22.1.0 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 28, 2022
@cockroach-teamcity cockroach-teamcity added this to roachtest/unit test backlog in KV Apr 28, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Apr 28, 2022
@tbg tbg removed this from roachtest/unit test backlog in KV Apr 28, 2022
@tbg tbg added this to To do in Observability Infrastructure via automation Apr 28, 2022
@tbg
Copy link
Member

tbg commented Apr 28, 2022

I220428 05:48:44.487000 1 util/log/flags.go:202  [-] 1  stderr capture started
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xab47ce]

goroutine 597421 [running]:
panic({0x474a820, 0x9023810})
	GOROOT/src/runtime/panic.go:1147 +0x3a8 fp=0xc019faff00 sp=0xc019fafe40 pc=0x48a768
runtime.panicmem(...)
	GOROOT/src/runtime/panic.go:221
runtime.sigpanic()
	GOROOT/src/runtime/signal_unix.go:735 +0x327 fp=0xc019faff50 sp=0xc019faff00 pc=0x4a1187
google.golang.org/grpc/internal/transport.(*Stream).Context(...)
	google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:403
google.golang.org/grpc.(*clientStream).Context(0xc017018500)
	google.golang.org/grpc/external/org_golang_google_grpc/stream.go:668 +0x2e fp=0xc019faff68 sp=0xc019faff50 pc=0xab47ce
github.com/cockroachdb/cockroach/pkg/util/tracing.newTracingClientStream.func2()
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpc_interceptor.go:389 +0x38 fp=0xc019faffe0 sp=0xc019faff68 pc=0xb8e258
runtime.goexit()
	GOROOT/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc019faffe8 sp=0xc019faffe0 pc=0x4bf181
created by github.com/cockroachdb/cockroach/pkg/util/tracing.newTracingClientStream
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpc_interceptor.go:384 +0x151

cc @cockroachdb/obs-inf-prs @andreimatei

@andreimatei
Copy link
Contributor

We've seen this before in #80306.
It's a NPE inside gRPC; I'm looking at whether anything around it changed.

@andreimatei
Copy link
Contributor

The crash comes from gRPC, here.
I think maybe what's going on is that we're breaking the gRPC's Context() contract here: we're calling ClientStream.Context() perhaps in violation of Context()'s contract, that says:

// It should not be called until after Header or RecvMsg has returned. Once
// called, subsequent client-side retries are disabled.

But, I can't tell exactly why that contract is spelled out the way it is, and it seems funny to me that the crash we have here is very rare. If the problem was as simple as calling cs.Context() before RecvMsg(), one would expect this to crash every time.
Digging one level down, the Context() impl is:

cs.attempt.s.Context()

The attempt field has an interesting comment:

So, attempt can be nil only inside newClientStream function when clientStream is first created.

The point where we call Context is after newClientStream(), so attempt should be fine. attempt.s also seems to be initialized early, as well as s.ctx. So, I don't know if we're really at fault.

Stepping back, this whole StreamClientInterceptor code sure seems dubious. The code was copied years ago from grpc-opentracing. I can't find a good history of the code over there. The code clearly struggles with figuring out when the stream "is done" so it can close the client tracing span. It also goes full desperate when it installs a finalizer as some sort of catch-all, below.
More modern gRPC OpenTracing code does not have any of this dubious stuff.

The OpenTelemetry equivalent interceptor code does have a similar context cancellation provision, added explicitly when someone complained about a goroutine leak when the stream is canceled, but this code uses the users ctx, not the stream's internal ctx. Which seems like the right thing to do to me, but I'm not sure.

I might ask the grpc-go people how this code is supposed to be written.

@andreimatei
Copy link
Contributor

I've asked for grpc-go guidance in grpc/grpc-go#5324

@andreimatei
Copy link
Contributor

Seems like
a) there is indeed a grpc-go bug that they're fixing
b) our code is suboptimal, and we don't need to be doing the thing that's triggering the bug

I will have a patch.

andreimatei added a commit to andreimatei/cockroach that referenced this issue May 2, 2022
Before this patch, our client-side tracing interceptor for streaming rpc
calls was exposed to gRPC bugs being currently fixed in
github.com/grpc/grpc-go/pull/5323. This had to do with calls to
clientStream.Context() panicking with an NPE under certain races with
RPCs failing. We've recently gotten two crashes seemingly because of
this. It's unclear why this hasn't shown up before, as nothing seems new
(either on our side or on the grpc side). In 22.2 we do use more
streaming RPCs than before (for example for span configs), so maybe that
does it.

This patch eliminates the problem by eliminating the
problematic call into ClientStream.Context(). The background is that
our interceptors needs to watch for ctx cancelation and consider the RPC
done at that point. But, there was no reason for that call; we can more
simply use the RPC caller's ctx for the purposes of figuring out if the
caller cancels the RPC. In fact, calling ClientStream.Context() is bad
for other reasons, beyond exposing us to the bug:
1) ClientStream.Context() pins the RPC attempt to a lower-level
connection, and inhibits gRPC's ability to sometimes transparently
retry failed calls. In fact, there's a comment on ClientStream.Context()
that tells you not to call it before using the stream through other
methods like Recv(), which imply that the RPC is already "pinned" and
transparent retries are no longer possible anyway. We were breaking
this.
2) One of the grpc-go maintainers suggested that, due to the bugs
reference above, this call could actually sometimes give us "the
wrong context", although how wrong exactly is not clear to me (i.e.
could we have gotten a ctx that doesn't inherit from the caller's ctx?
Or a ctx that's canceled independently from the caller?)

This patch also removes a paranoid catch-all finalizer in the
interceptor that assured that the RPC client span is always eventually
closed (at a random GC time), regardless of what the caller does - i.e.
even if the caller forgets about interacting with the response stream or
canceling the ctx.  This kind of paranoia is not needed. The code in
question was copied from grpc-opentracing[1], which quoted a
StackOverflow answer[2] about whether or not a client is allowed to
simply walk away from a streaming call. As a result of conversations
triggered by this patch [3], that SO answer was updated to reflect the fact
that it is not, in fact, OK for a client to do so, as it will leak gRPC
resources. The client's contract is specified in [4] (although arguably
that place is not the easiest to find by a casual gRPC user). In any
case, this patch gets rid of the finalizer. This could in theory result
in leaked spans if our own code is buggy in the way that the paranoia
prevented, but all our TestServers check that spans don't leak like that
so we are pretty protected. FWIW, a newer re-implementation of the
OpenTracing interceptor[4] doesn't have the finalizer (although it also
doesn't listen for ctx cancellation, so I think it's buggy), and neither
does the equivalent OpenTelemetry interceptor[6].

Fixes cockroachdb#80689

[1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174
[2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl
[3] grpc/grpc-go#5324
[4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream
[5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52
[6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193

Release note: A rare crash indicating a nil-pointer deference in
google.golang.org/grpc/internal/transport.(*Stream).Context(...)
was fixed.
craig bot pushed a commit that referenced this issue May 3, 2022
80476: security: update the TLS cipher suite list r=bdarnell a=knz

Fixes #80483 

This does not really change the list, it merely explains more clearly
how it was built.

Release note: None

80635: release: use preflight tool for RedHat images r=celiala a=rail

RedHat Connect has introduced a new certification workflow, which
requires running some tests locally and submit the results back to
RedHat.

This patch runs the `preflight` tool to address the new workflow
changes.

Fixes #80633

Release note: None

80878: util/tracing: fix crash in StreamClientInterceptor r=andreimatei a=andreimatei

Before this patch, our client-side tracing interceptor for streaming rpc
calls was exposed to gRPC bugs being currently fixed in
github.com/grpc/grpc-go#5323. This had to do with calls to
clientStream.Context() panicking with an NPE under certain races with
RPCs failing. We've recently gotten two crashes seemingly because of
this. It's unclear why this hasn't shown up before, as nothing seems new
(either on our side or on the grpc side). In 22.2 we do use more
streaming RPCs than before (for example for span configs), so maybe that
does it.

This patch eliminates the problem by eliminating the
problematic call into ClientStream.Context(). The background is that
our interceptors needs to watch for ctx cancelation and consider the RPC
done at that point. But, there was no reason for that call; we can more
simply use the RPC caller's ctx for the purposes of figuring out if the
caller cancels the RPC. In fact, calling ClientStream.Context() is bad
for other reasons, beyond exposing us to the bug:
1) ClientStream.Context() pins the RPC attempt to a lower-level
connection, and inhibits gRPC's ability to sometimes transparently
retry failed calls. In fact, there's a comment on ClientStream.Context()
that tells you not to call it before using the stream through other
methods like Recv(), which imply that the RPC is already "pinned" and
transparent retries are no longer possible anyway. We were breaking
this.
2) One of the grpc-go maintainers suggested that, due to the bugs
reference above, this call could actually sometimes give us "the
wrong context", although how wrong exactly is not clear to me (i.e.
could we have gotten a ctx that doesn't inherit from the caller's ctx?
Or a ctx that's canceled independently from the caller?)

This patch also removes a paranoid catch-all finalizer in the
interceptor that assured that the RPC client span is always eventually
closed (at a random GC time), regardless of what the caller does - i.e.
even if the caller forgets about interacting with the response stream or
canceling the ctx.  This kind of paranoia is not needed. The code in
question was copied from grpc-opentracing[1], which quoted a
StackOverflow answer[2] about whether or not a client is allowed to
simply walk away from a streaming call. As a result of conversations
triggered by this patch [3], that SO answer was updated to reflect the fact
that it is not, in fact, OK for a client to do so, as it will leak gRPC
resources. The client's contract is specified in [4] (although arguably
that place is not the easiest to find by a casual gRPC user). In any
case, this patch gets rid of the finalizer. This could in theory result
in leaked spans if our own code is buggy in the way that the paranoia
prevented, but all our TestServers check that spans don't leak like that
so we are pretty protected. FWIW, a newer re-implementation of the
OpenTracing interceptor[4] doesn't have the finalizer (although it also
doesn't listen for ctx cancellation, so I think it's buggy), and neither
does the equivalent OpenTelemetry interceptor[6].

Fixes #80689

[1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174
[2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl
[3] grpc/grpc-go#5324
[4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream
[5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52
[6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193

Release note: A rare crash indicating a nil-pointer deference in
google.golang.org/grpc/internal/transport.(*Stream).Context(...)
was fixed.

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Rail Aliiev <rail@iqchoice.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@craig craig bot closed this as completed in 13a65d3 May 3, 2022
Observability Infrastructure automation moved this from To do to Done May 3, 2022
@celiala
Copy link
Collaborator

celiala commented May 3, 2022

Hi @andreimatei @dhartunian -

I see the backport to release-22.1 (for v22.1.1, #80911), but I don't see a backport of this to release-22.1.0 (for v22.1.0). Just want to check in to see if we need another backport for v22.1.0? Or if this is slated for v22.1.1 only?

Thanks!

@celiala
Copy link
Collaborator

celiala commented May 3, 2022

I'll re-open this issue, so we can keep track of the open 22.1 backports

@celiala celiala reopened this May 3, 2022
Observability Infrastructure automation moved this from Done to In progress May 3, 2022
@andreimatei
Copy link
Contributor

I see the backport to release-22.1 (for v22.1.1, #80911), but I don't see a backport of this to release-22.1.0 (for v22.1.0). Just want to check in to see if we need another backport for v22.1.0? Or if this is slated for v22.1.1 only?

I forgot about this other branch. I'll do another backport.

@celiala
Copy link
Collaborator

celiala commented May 4, 2022

Closing issue, as #80936 backports have been merged. Thank you!

@celiala celiala closed this as completed May 4, 2022
Observability Infrastructure automation moved this from In progress to Done May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). deprecated-branch-release-22.1.0 O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants