Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in WriteStatus seems to lead to invalid responses from server #5511

Closed
howardjohn opened this issue Jul 18, 2022 · 5 comments · Fixed by #5513
Closed

Race condition in WriteStatus seems to lead to invalid responses from server #5511

howardjohn opened this issue Jul 18, 2022 · 5 comments · Fixed by #5513
Assignees

Comments

@howardjohn
Copy link
Contributor

What version of gRPC are you using?

I ran into this on v1.47.0. The same issue was reported spiffe/spire#3229 on v1.48.0 though.

What version of Go are you using (go version)?

1.18

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I don't yet have a trivial reproducer - I think it should be possible to setup but I wanted to get this written up before I forget.

We have a bi-di stream. It happens to be serving XDS, but I don't think that is relevant. Our server code is essentially:

func StreamSecrets(stream sds.SecretDiscoveryService_StreamSecretsServer) {
go func() {
  stream.Recv()
}()
return fmt.Errorf("bad!")
}

The real logic is a lot more complicated, but basically we send an error back and are calling Recv() concurrently seem to be the two important aspects.

What did you expect to see?

We expect the client to get the error "bad!"

What did you see instead?

Clients get Status code="Internal" msg="malformed header: missing HTTP status; malformed header: missing HTTP content-type"


I did some debugging here, and I believe I understand what is going on here. However, every time I think there is a gRPC bug its usually a bug in our app, so take it with a grain of salt...

I annotated some key points in the grpc code with some extra logs. The results are below:

2022-07-18T23:43:44.734089Z     info    ads     All caches have been synced up in 232.572506ms, marking server ready
2022-07-18T23:43:44.737805Z     info    sds     Starting SDS grpc server
2022-07-18T23:43:44.738124Z     info    ads     XDS: Incremental Pushing:0 ConnectedEndpoints:0 Version:
howardjohn start process loop (*transport.incomingSettings)
howardjohn process loop &{ss:[]} (*transport.incomingSettings): <nil>
howardjohn frame  [FrameHeader SETTINGS flags=ACK len=0]
howardjohn start process loop (*transport.incomingSettings)
howardjohn process loop &{ss:[{ID:5 Val:16384}]} (*transport.incomingSettings): <nil>
howardjohn start process loop (*transport.headerFrame)
howardjohn process loop &{streamID:1 hf:[{Name::method Value:POST Sensitive:false} {Name::scheme Value:http Sensitive:false} {Name::path Value:/envoy.service.secret.v3.SecretDiscoveryService/StreamSecrets Sensitive:false} {Name::authority Value:./var/run/secrets/workload-spiffe-uds/socket Sensitive:false} {Name:content-type Value:application/grpc Sensitive:false} {Name:user-agent Value:grpc-go/1.47.0 Sensitive:false} {Name:te Value:trailers Sensitive:false}] endStream:false initStream:0x229b8c0 onWrite:<nil> wq:0xc000c50200 cleanup:<nil> onOrphaned:0x229bbc0} (*transport.headerFrame): <nil>
howardjohn start process loop (*transport.dataFrame)
howardjohn process loop &{streamID:1 endStream:false h:[0 0 0 0 139] d:[18 58 10 54 115 105 100 101 99 97 114 126 49 46 49 46 49 46 49 126 116 101 115 116 46 100 101 102 97 117 108 116 126 100 101 102 97 117 108 116 46 115 118 99 46 99 108 117 115 116 101 114 46 108 111 99 97 108 26 0 26 7 100 101 102 97 117 108 116 34 68 116 121 112 101 46 103 111 111 103 108 101 97 112 105 115 46 99 111 109 47 101 110 118 111 121 46 101 120 116 101 110 115 105 111 110 115 46 116 114 97 110 115 112 111 114 116 95 115 111 99 107 101 116 115 46 116 108 115 46 118 51 46 83 101 99 114 101 116] onEachWrite:<nil>} (*transport.dataFrame): <nil>
howardjohn start process loop (*transport.registerStream)
howardjohn process loop &{streamID:1 wq:0xc000d06260} (*transport.registerStream): <nil>
howardjohn start process loop (*transport.outgoingWindowUpdate)
2022-07-18T23:43:44.747016Z     debug   ads     Unauthenticated XDS: @
howardjohn process loop &{streamID:0 increment:144} (*transport.outgoingWindowUpdate): <nil>
howardjohn start process loop (*transport.ping)
howardjohn process loop &{ack:false data:[2 4 16 16 9 14 7 7]} (*transport.ping): <nil>
howardjohn frame  [FrameHeader WINDOW_UPDATE len=4]
howardjohn frame  [FrameHeader PING len=8]
howardjohn start process loop (*transport.incomingWindowUpdate)
howardjohn process loop &{streamID:0 increment:144} (*transport.incomingWindowUpdate): <nil>
howardjohn start process loop (*transport.ping)
howardjohn process loop &{ack:true data:[2 4 16 16 9 14 7 7]} (*transport.ping): <nil>
2022-07-18T23:43:44.748969Z     warn    model   Istio Version is not found in metadata for test.default, which may have undesirable side effects
2022-07-18T23:43:44.749159Z     info    ads     ADS: new connection for node:test.default-1
2022-07-18T23:43:44.749275Z     debug   ads     ADS:SDS: REQ test.default-1 resources:1 nonce: version:
2022-07-18T23:43:44.749375Z     debug   ads     ADS:SDS: INIT/RECONNECT test.default-1
2022-07-18T23:43:44.749493Z     debug   ads     SDS: SKIP request for node:test.default resource:default
2022-07-18T23:43:44.751622Z     error   howardjohn: complete stream: failed to generate secret for default: resource default not found
howardjohn header set
howardjohn write status [header field ":status" = "200" header field "content-type" = "application/grpc" header field "grpc-status" = "2" header field "grpc-message" = "failed to generate secret for default: resource default not found"], streamID:1, sent:false
goroutine 65 [running]:
runtime/debug.Stack()
        /usr/local/google/home/howardjohn/sdk/go1.18/src/runtime/debug/stack.go:24 +0x72
google.golang.org/grpc/internal/transport.(*http2Server).WriteStatus(0xc0010a8000, 0xc000b2e000, 0xc000d0c0d8)
        /usr/local/google/home/howardjohn/go/pkg/mod/google.golang.org/grpc@v1.47.0/internal/transport/http2_server.go:1051 +0xa8e
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000bdf6c0, {0x61315e0, 0xc0010a8000}, 0xc000b2e000, 0xc000ff58f0, 0x7d93500, 0x0)
        /usr/local/google/home/howardjohn/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1565 +0x1f0b
google.golang.org/grpc.(*Server).handleStream(0xc000bdf6c0, {0x61315e0, 0xc0010a8000}, 0xc000b2e000, 0x0)
        /usr/local/google/home/howardjohn/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1624 +0xf8a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
        /usr/local/google/home/howardjohn/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922 +0xed
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /usr/local/google/home/howardjohn/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:920 +0x4be

howardjohn start write status [header field ":status" = "200" header field "content-type" = "application/grpc" header field "grpc-status" = "2" header field "grpc-message" = "failed to generate secret for default: resource default not found"], streamID:1, sent:false
howardjohn header set
howardjohn write status [header field "grpc-status" = "1" header field "grpc-message" = "context canceled"], streamID:1, sent:true
goroutine 114 [running]:
runtime/debug.Stack()
        /usr/local/google/home/howardjohn/sdk/go1.18/src/runtime/debug/stack.go:24 +0x72
google.golang.org/grpc/internal/transport.(*http2Server).WriteStatus(0xc0010a8000, 0xc000b2e000, 0xc000e6e008)
        /usr/local/google/home/howardjohn/go/pkg/mod/google.golang.org/grpc@v1.47.0/internal/transport/http2_server.go:1051 +0xa8e
google.golang.org/grpc.(*serverStream).RecvMsg.func1()
        /usr/local/google/home/howardjohn/go/pkg/mod/google.golang.org/grpc@v1.47.0/stream.go:1580 +0x539
google.golang.org/grpc.(*serverStream).RecvMsg(0xc0003d3450, {0x598db40?, 0xc000eca640})
        /usr/local/google/home/howardjohn/go/pkg/mod/google.golang.org/grpc@v1.47.0/stream.go:1606 +0x3cb
github.com/envoyproxy/go-control-plane/envoy/service/secret/v3.(*secretDiscoveryServiceStreamSecretsServer).Recv(0xc000a4e9c0)
        /usr/local/google/home/howardjohn/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.10.3-0.20220630171008-2ce288a42860/envoy/service/secret/v3/sds.pb.go:373 +0x68
istio.io/istio/pilot/pkg/xds.(*DiscoveryServer).receive(0xc000e8db80, 0xc000eca460, {0x0, 0x0, 0x0})
        /usr/local/google/home/howardjohn/go/src/istio.io/istio/pilot/pkg/xds/ads.go:147 +0x112
created by istio.io/istio/pilot/pkg/xds.(*DiscoveryServer).Stream
        /usr/local/google/home/howardjohn/go/src/istio.io/istio/pilot/pkg/xds/ads.go:298 +0xb76

howardjohn start write status [header field "grpc-status" = "1" header field "grpc-message" = "context canceled"], streamID:1, sent:true
howardjohn executed write status [header field "grpc-status" = "1" header field "grpc-message" = "context canceled"], streamID:1, sent:true, success:true/<nil>
2022-07-18T23:43:44.752983Z     info    ads     ADS: "@" test.default-1 terminated
howardjohn executed write status [header field ":status" = "200" header field "content-type" = "application/grpc" header field "grpc-status" = "2" header field "grpc-message" = "failed to generate secret for default: resource default not found"], streamID:1, sent:false, success:true/<nil>
howardjohn start process loop (*transport.headerFrame)
howardjohn process loop &{streamID:1 hf:[{Name:grpc-status Value:1 Sensitive:false} {Name:grpc-message Value:context canceled Sensitive:false}] endStream:true initStream:<nil> onWrite:0x22c0d00 wq:<nil> cleanup:0xc000d015c0 onOrphaned:<nil>} (*transport.headerFrame): <nil>
howardjohn frame  [FrameHeader HEADERS flags=END_STREAM|END_HEADERS stream=1 len=35]
howardjohn header field "grpc-status" = "1"
howardjohn header field "grpc-message" = "context canceled"
howardjohn frame  [FrameHeader RST_STREAM stream=1 len=4]
2022-07-18T23:43:44.761797Z     warn    ads     ads received error: rpc error: code = Internal desc = malformed header: missing HTTP status; malformed header: missing HTTP content-type
howardjohn start process loop (*transport.cleanupStream)
howardjohn process loop &{streamID:1 rst:true rstCode:1 onWrite:0x229c400} (*transport.cleanupStream): <nil>
    sdsservice_test.go:219: didn't get expected error; got rpc error: code = Internal desc = malformed header: missing HTTP status; malformed header: missing HTTP content-type
howardjohn start process loop (*transport.cleanupStream)
howardjohn process loop &{streamID:1 rst:false rstCode:0 onWrite:0x22af7a0} (*transport.cleanupStream): <nil>
howardjohn loopy err connection error: desc = "transport is closing"
howardjohn loopy err connection error: desc = "transport is closing"
2022-07-18T23:43:45.266704Z     info    sds     SDS server for workload certificates started, listening on "./var/run/secrets/workload-spiffe-uds/socket"

What we see is we get two calls to WriteStatus. One is from our application returning the error, and then a later from Recv (

grpc-go/stream.go

Line 1588 in f601dfa

ss.t.WriteStatus(ss.s, st)
). These are done in the correct order. However, they are not completed in the correct order.

We have a mutex here:

s.hdrMu.Unlock()
, but its unlocked before we actually execute the send headers. Inside this mutex, we check if we have to send the initial :status header:
if !s.updateHeaderSent() { // No headers have been sent.
.

The order of events that happens is (t1/t2 are different goroutines):

* t1: WriteStatus(userError)
* t1: lock header mutex
* t1: updateHeaderSent=false, so add `:status`
* t1: compute headerFields with `:status`
* t1: unlock header mutex
* t2: WriteStatus(Recv error)
* t2: lock header mutex
* t2: updateHeaderSent=true, so do NOT add `:status`
* t2: unlock header mutex
* t2: finishStream() called

At this point, we sent an initial header frame without status. This ends up being rejected on the client side:

httpStatusErr = "malformed header: missing HTTP status"
.

To validate this idea, I added a mutex covering finishStream. This is probably a bad idea for real world performance, but was just for testing this. With this, I was unable to reproduce

@howardjohn
Copy link
Contributor Author

Actually was able to put together a minimal reproducer:

package main

import (
	"context"
	"fmt"
	"log"
	"net"
	"strings"
	"testing"

	discovery "github.com/envoyproxy/go-control-plane/envoy/service/discovery/v3"
	sds "github.com/envoyproxy/go-control-plane/envoy/service/secret/v3"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"
)

type service struct {
}

func (s *service) DeltaSecrets(server sds.SecretDiscoveryService_DeltaSecretsServer) error {
	//TODO implement me
	panic("implement me")
}

func (s *service) FetchSecrets(ctx context.Context, request *discovery.DiscoveryRequest) (*discovery.DiscoveryResponse, error) {
	//TODO implement me
	panic("implement me")
}

func (s *service) StreamSecrets(stream sds.SecretDiscoveryService_StreamSecretsServer) error {
	go func() {
		_, err := stream.Recv()
		log.Println(err)
	}()
	return fmt.Errorf("test error")
}

func TestGRPCRace(t *testing.T) {
	srv := &service{}
	rpcs := grpc.NewServer()
	sds.RegisterSecretDiscoveryServiceServer(rpcs, srv)
	l, err := net.Listen("tcp", "127.0.0.1:0")
	if err != nil {
		t.Fatal(err)
	}
	go func() {
		rpcs.Serve(l)
	}()
	con, err := grpc.Dial(l.Addr().String(), grpc.WithTransportCredentials(insecure.NewCredentials()))
	if err != nil {
		t.Fatal(err)
	}
	xds := sds.NewSecretDiscoveryServiceClient(con)
	c, err := xds.StreamSecrets(context.Background())

	if err != nil {
		t.Fatal(err)
	}
	_, err = c.Recv()
	if !strings.Contains(err.Error(), "test error") {
		t.Fatalf("unxpected error %v", err)
	}
}

go test -c . -race; stress ./main.test should reproduce it pretty quickly

@howardjohn
Copy link
Contributor Author

bisect leads to #5292 as the regression cc @idiamond-stripe

@dfawley
Copy link
Member

dfawley commented Jul 19, 2022

Thank you for the report and the reproduction case @howardjohn. I'll work on a fix for this today.

@dfawley
Copy link
Member

dfawley commented Jul 19, 2022

#5513 sent to fix it. It should be fine to hold the stream's header mutex during these entire functions; they are very infrequently called and the mutex is very limited in scope.

@dfawley
Copy link
Member

dfawley commented Jul 19, 2022

I wonder if we need a client-side change here, too... E.g. to ignore the extra headers on a stream after END_STREAM was already sent, but possibly to kill the connection or send a RST_STREAM or both since the server seems to be violating the HTTP/2 protocol in this case.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants