Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface TLS errors to RPC errors #4163

Closed
avalchev94 opened this issue Jan 19, 2021 · 23 comments
Closed

Surface TLS errors to RPC errors #4163

avalchev94 opened this issue Jan 19, 2021 · 23 comments

Comments

@avalchev94
Copy link

Hello, I was trying to check what error was returned from the server when the client's certificate is expired. The error I got was generic: transport is closing. At first I was thinking the idea was to not expose the error, for security reasons. However, after some research, I've found that the std tls package sends alerts to the client, in that case the alert was: alertBadCertificate. So far so good, but it turned out that the grpc http2 client disregards that alert and does not expose it to the end user. That's the code where error/alert is disregarded:

// .../google.golang.org/grpc@v1.27.0/internal/transport/http2_client.go:1242

func (t *http2Client) reader() {
	defer close(t.readerDone)
	// Check the validity of server preface.
	frame, err := t.framer.fr.ReadFrame()
	if err != nil {
		t.Close() // this kicks off resetTransport, so must be last before return
		return
	}
	t.conn.SetReadDeadline(time.Time{}) // reset deadline once we get the settings frame (we didn't time out, yay!)
	if t.keepaliveEnabled {
		atomic.StoreInt64(&t.lastRead, time.Now().UnixNano())
	}
	sf, ok := frame.(*http2.SettingsFrame)
	if !ok {
		t.Close() // this kicks off resetTransport, so must be last before return
		return
	}
// ....

After debugging, I've found that in that line frame, err := t.framer.fr.ReadFrame() alert is received as error, but not used and just the transport is closed, without any more info. Is that on purpose? If not, I am willing to contribute to the project and fix it.

@menghanl
Copy link
Contributor

Thanks for filing the issue. It should be better to return this error to the users.

when the client's certificate is expired

When this happens, I would think the connection creation would fail due to a handshake error:

conn, authInfo, err = transportCreds.ClientHandshake(connectCtx, addr.ServerName, conn)

If the error happens here, it will be surfaced to the users.
Can you find out what happened at this line in your test? Did this fail? And if not, why?

@avalchev94
Copy link
Author

It does not fail on ClientHandshake, because server certificate is good and code execution reaches the start of the reader:

It fails on the server - ServerHandshake, on the client's certificate verification: https://github.com/golang/go/blob/824f2d635ca96a7bded6ed039f8a9c0f06ee1443/src/crypto/tls/handshake_server.go#L799-L803

As you can see here, when client certificate is bad, server sends an alert to the client: alertBadCertificate. I was debugging why that alert is not at least logged in the client and find out that the http2client reader receives that alert(as an error), but is not "shown" to the end user:

frame, err := t.framer.fr.ReadFrame()
if err != nil {
t.Close() // this kicks off resetTransport, so must be last before return
return
}

@dfawley
Copy link
Member

dfawley commented Jan 20, 2021

Yes, it would be good if we can surface this error (and other non-generic connection errors) to users via RPC errors (instead of just logs, as we've mostly historically done).

Possibly by passing an error to *http2Client.Close() and using it here:

t.closeStream(s, ErrConnClosing, false, http2.ErrCodeNo, status.New(codes.Unavailable, ErrConnClosing.Desc), nil, false)

However, for this particular situation, I don't think this will be enough - RPCs may not have been dispatched to this transport yet.

To propagate this to future RPCs (assuming this is the only connection), we would need to make sure the error is sent to the LB policy. For that, updateConnectivityState must be called with the TLS error:

func (ac *addrConn) updateConnectivityState(s connectivity.State, lastErr error) {

We may be able to do that by making the onClose callback, given to the transport, accept an error:

onClose := func() {

There's a little more to this, still: if a connection error happens after we had a READY connection, we will immediately transition back to CONNECTING, so this will be lost. In this situation, we don't have a READY connection yet as the server preface was never received. So, we need to make sure the error given to onClose is passed along and returned from createTransport here, probably by changing onCloseCalled into a chan error instead of just a signaling channel:

grpc-go/clientconn.go

Lines 1337 to 1339 in 504caa9

case <-onCloseCalled:
// The transport has already closed - noop.
return nil, nil, errors.New("connection closed")

All in all, this is a pretty extensive change touching many layers, but should be mostly straightforward.

Also, a reasonable end2end test (in grpc/test) for this shouldn't be too hard - we just need fake credentials client-side to complete its handshake without traffic, but return an error from the first Read.

@menghanl does that all sound sensible to you? @avalchev94 would you still be interested in attempting something like this, if so?

@avalchev94
Copy link
Author

@dfawley I am in. However, it would take time, because I need to enter the project. Do you guys have some group chat, where I can ask some questions if I get stuck with the code? If not, is here appropriate place for such communication?

In addition, I would like to implement that small feature as well, because I believe it's really important to be able to check the error's type at the end, instead of comparing strings.

@dfawley
Copy link
Member

dfawley commented Jan 21, 2021

@avalchev94 we don't have a chat for grpc-go that we use regularly; this thread should be fine. If you would like to chat in real time, one option would be to use our gitter channel. We don't usually poll that, so if you want to chat with us there, let us know first so we know to be on the lookout for messages.

I'm not sure about the other feature request. I'll have to take a closer look. I'm not sure if that will be feasible, and while these two things are related, they are not dependent on one another.

@easwars
Copy link
Contributor

easwars commented May 3, 2021

@avalchev94 : Are you still interested in working on this?
I do see that the other issue you referred to has been fixed.

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label May 25, 2021
@dfawley dfawley changed the title Why does HTTP2 client disregard tls alerts sent from the server? Surface TLS errors to RPC errors May 25, 2021
@dfawley dfawley removed the stale label May 25, 2021
@dfawley
Copy link
Member

dfawley commented May 25, 2021

With #4311, I don't think there is too much left to do here.

@anitgandhi
Copy link

Hey folks; we use mTLS quite heavily in our organization so we've run into this issue quite a bit as of late. Namely, instead of getting the TLS alert code, we just get connection closed before server preface received as the error message; including with and without using WithBlock(). We're using v1.43.x.

It seems like the last piece after #4311 is to propagate the error further to ClientConn and/or LB policy.

Is anyone working on this already? If not, I could take a stab at it - though I admit it may take me a while to get used to the guts of the transport/conn code 😄

@easwars
Copy link
Contributor

easwars commented May 9, 2022

None of us are working on this actively and there is little chance that someone might pick this up this quarter. So. please go ahead and take a shot at it. If you get stuck somewhere, feel free to reach out to us on this thread.

@srimaln91
Copy link

Hi Everyone. as @anitgandhi reported, we would need the same feature to be available so we can display the TLS error on the client side. Is anyone working on this? I would also like to make a contribution but will take a while to go through it and get used to the code.

@easwars
Copy link
Contributor

easwars commented Nov 12, 2022

@anitgandhi @srimaln91
I will take another look at this issue and will summarize what still needs to be done based on the current state of affairs and will report back soon. Thanks.

@easwars easwars self-assigned this Nov 12, 2022
@easwars easwars removed the fixit label Nov 16, 2022
@easwars
Copy link
Contributor

easwars commented Nov 17, 2022

I tried running our mTLS example where the client establishes an mTLS connection with the server. I used faketime on the server to manipulate the system time and make it appear as though the certificate presented by the client is expired.

This is how the client logs look:

easwars@easwars-macbookpro3:client $ GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=2 go run main.go
2022/11/16 17:40:05 INFO: [core] [Channel #1] Channel created
2022/11/16 17:40:05 INFO: [core] [Channel #1] original dial target is: "localhost:50051"
2022/11/16 17:40:05 INFO: [core] [Channel #1] parsed dial target is: {Scheme:localhost Authority: Endpoint:50051 URL:{Scheme:localhost Opaque:50051 User: Host: Path: RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022/11/16 17:40:05 INFO: [core] [Channel #1] fallback to scheme "passthrough"
2022/11/16 17:40:05 INFO: [core] [Channel #1] parsed dial target is: {Scheme:passthrough Authority: Endpoint:localhost:50051 URL:{Scheme:passthrough Opaque: User: Host: Path:/localhost:50051 RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}
2022/11/16 17:40:05 INFO: [core] [Channel #1] Channel authority set to "x.test.example.com"
2022/11/16 17:40:05 INFO: [core] [Channel #1] Resolver state updated: {
  "Addresses": [
    {
      "Addr": "localhost:50051",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Type": 0,
      "Metadata": null
    } 
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
2022/11/16 17:40:05 INFO: [core] [Channel #1] Channel switches to new LB policy "pick_first"
2022/11/16 17:40:05 INFO: [core] [Channel #1 SubChannel #2] Subchannel created
2022/11/16 17:40:05 INFO: [core] blockingPicker: the picked transport is not ready, loop back to repick
2022/11/16 17:40:05 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2022/11/16 17:40:05 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "localhost:50051" to connect
2022/11/16 17:40:05 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc0001929f0, {CONNECTING <nil>}
2022/11/16 17:40:05 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2022/11/16 17:40:05 INFO: [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2022/11/16 17:40:05 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {
  "Addr": "localhost:50051",
  "ServerName": "x.test.example.com",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "error reading server preface: remote error: tls: bad certificate"
2022/11/16 17:40:05 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE
2022/11/16 17:40:05 INFO: [core] pickfirstBalancer: UpdateSubConnState: 0xc0001929f0, {TRANSIENT_FAILURE connection error: desc = "error reading server preface: remote error: tls: bad certificate"}
2022/11/16 17:40:05 INFO: [core] [Channel #1] Channel Connectivity change to TRANSIENT_FAILURE
2022/11/16 17:40:05 client.UnaryEcho(_) = _, rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: remote error: tls: bad certificate":
exit status 1

You can see the following WARNING log on the client:

2022/11/16 17:40:05 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {
  "Addr": "localhost:50051",
  "ServerName": "x.test.example.com",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "error reading server preface: remote error: tls: bad certificate"

And the actual error seen by the RPC:

2022/11/16 17:40:05 client.UnaryEcho(_) = _, rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: remote error: tls: bad certificate":

And this is how I run the server and these are the logs:

easwars@easwars-macbookpro3:server $
easwars@easwars-macbookpro3:server $ GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=2 faketime '2052-11-11 00:00:00' go run main.go
2052/11/11 00:00:05 server starting on port 50051...
2052/11/11 00:00:05 INFO: [core] [Server #1] Server created
2052/11/11 00:00:05 INFO: [core] [Server #1 ListenSocket #2] ListenSocket created
2052/11/11 00:00:25 INFO: [core] [Server #1] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "ServerHandshake(\"127.0.0.1:54930\") failed: tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2052-11-11T00:00:25-08:00 is after 2032-03-15T21:44:59Z"

Looks like we are surfacing the error on the client side and are also throwing logs on both ends.

@easwars
Copy link
Contributor

easwars commented Nov 17, 2022

As @dfawley mentioned in the other comment, these changes on the client were made possible by #4311.

@anitgandhi @srimaln91 @avalchev94
At this point, I'm tempted to say that things are working as expected now. Please let us know if you think otherwise.

@anitgandhi
Copy link

Thanks @easwars ; admittedly maybe I should make this a separate issue, but in our org, we the error we see is typically connection closed before server preface received, on the Dial especially with grpc.WithReturnConnectionError. It's there that we'd prefer to see the error reading server preface: remote error: tls: bad certificate surfaced

@easwars
Copy link
Contributor

easwars commented Nov 17, 2022

What version of gRPC are you running?

Can you provide us with a way to reproduce the error that you are seeing?

@anitgandhi
Copy link

It's happening for me on both v1.48.0 and v1.50.1

I'll see if I can make a minimal reproducer program

@srimaln91
Copy link

@easwars @anitgandhi I already have a simple server, client program written in order to reproduce the issue. Expired certificates are already included and configured in the code. Please find the attached .zip file.

grpc-test.zip

Run the server

go run cmd/server/main.go

Run the client

go run cmd/client/main.go

@easwars
Copy link
Contributor

easwars commented Nov 17, 2022

Thank you @srimaln91 for the repro.

What is happening is a little weird and I haven't gotten to the bottom of it yet. In our examples go.mod, we use a replace directive to get the local copy of grpc-go. When I remove the replace directive and make the examples use the latest grpc version, I see the same behavior. I see an error which looks like this:

2022/11/17 15:39:12 client.UnaryEcho(_) = _, rpc error: code = Unavailable desc = connection closed before server preface received: 

@easwars
Copy link
Contributor

easwars commented Nov 18, 2022

We made some recent changes in how we handle transport creation failures. See: #5731

I tested with your repro, and did the following to update to the most recent commit on the grpc-go repo:

go get -u google.golang.org/grpc@817c1e8c417e41bc4e697dae9c94645cbd059639

Now, I do see error reading server preface: remote error: tls: bad certificate. Please try this and let me know what you are seeing.

@srimaln91

@easwars easwars assigned srimaln91 and unassigned easwars Nov 18, 2022
@srimaln91
Copy link

@easwars Thanks for the prompt update. I can see the error now.

panic: context deadline exceeded: connection error: desc = "error reading server preface: remote error: tls: bad certificate"

@anitgandhi
Copy link

I can confirm that v1.51.0 the issue is resolved for us as well. Thank you so much!

@easwars
Copy link
Contributor

easwars commented Nov 21, 2022

Wohoo !! Thank you all !!

@easwars easwars closed this as completed Nov 21, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants