Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: Poor RPC error messages when unable to get listener resource #5020

Closed
ejona86 opened this issue Dec 1, 2021 · 14 comments · Fixed by #5032
Closed

xds: Poor RPC error messages when unable to get listener resource #5020

ejona86 opened this issue Dec 1, 2021 · 14 comments · Fixed by #5032
Assignees
Labels
P2 Type: Feature New features or improvements in behavior

Comments

@ejona86
Copy link
Member

ejona86 commented Dec 1, 2021

This is related to #4939 (comment).

I created a trash bootstrap (td-grpc-bootstrap --gcp-project-number 5) and ran the xds client on my local machine without any credentials.

examples/features/xds/client$ time ./client 
2021/11/30 16:19:13 could not greet: rpc error: code = DeadlineExceeded desc = context deadline exceeded

real	0m10.157s
user	0m0.087s
sys	0m0.193s

I'd have hoped the RPC wouldn't need to timeout, because xds client is unable to do a successful RPC. I thought that was supposed to produce an error quickly from xds client. It looks like Java does produce an error quickly ("Permission denied on resource project #5"). So that seems like a bug, but let's work around it.

Changing the RPC's timeout in main.go to 30s, let's try again:

examples/features/xds/client$ time ./client 
2021/11/30 16:20:31 could not greet: rpc error: code = Unavailable desc = name resolver error: xds: ListenerResource target localhost:50051 not found, watcher timeout

real	0m15.147s
user	0m0.095s
sys	0m0.161s

That's... not that helpful. You have to be a grpc developer to understand that error message properly (others might think they understand it, but they'd be wrong). I could accept that error if the server was slow or something. But that's not the case.

examples/features/xds/client$ GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info ./client
...
2021/11/30 16:21:20 WARNING: xds: ADS stream creation failed: rpc error: code = Unauthenticated desc = transport: context canceled
2021/11/30 16:21:22 WARNING: xds: ADS stream creation failed: rpc error: code = Unauthenticated desc = transport: context canceled
2021/11/30 16:21:24 WARNING: xds: ADS stream creation failed: rpc error: code = Unauthenticated desc = transport: context canceled
2021/11/30 16:21:29 WARNING: xds: ADS stream creation failed: rpc error: code = Unauthenticated desc = transport: context canceled
2021/11/30 16:21:35 WARNING: xds: ADS stream creation failed: rpc error: code = Unauthenticated desc = transport: context canceled
2021/11/30 16:21:35 INFO: [xds] [xds-resolver 0xc00040d560] received LDS update: {
  "RouteConfigName": "",
  "InlineRouteConfig": null,
  "MaxStreamDuration": 0,
  "HTTPFilters": null,
  "InboundListenerCfg": null,
  "Raw": null
}, err: xds: ListenerResource target localhost:50051 not found, watcher timeout
2021/11/30 16:21:35 WARNING: [xds] [xds-resolver 0xc00040d560] Watch error on resource localhost:50051 from xds-client 0x173e760, xds: ListenerResource target localhost:50051 not found, watcher timeout
2021/11/30 16:21:35 WARNING: [core] ccResolverWrapper: reporting error to cc: xds: ListenerResource target localhost:50051 not found, watcher timeout
2021/11/30 16:21:35 INFO: [core] ClientConn switching balancer to "pick_first"
2021/11/30 16:21:35 INFO: [core] Channel switches to new LB policy "pick_first"
2021/11/30 16:21:35 INFO: [core] Channel Connectivity change to TRANSIENT_FAILURE
2021/11/30 16:21:35 INFO: [core] pickfirstBalancer: ResolverError called with error xds: ListenerResource target localhost:50051 not found, watcher timeout
2021/11/30 16:21:35 could not greet: rpc error: code = Unavailable desc = name resolver error: xds: ListenerResource target localhost:50051 not found, watcher timeout

Given that this situation is super-likely for new users, it is very important that we promptly provide helpful error messages in situations like this one.

@dfawley
Copy link
Member

dfawley commented Dec 22, 2021

@menghanl was #5054 enough to resolve this issue or is there more to be done?

@ejona86
Copy link
Member Author

ejona86 commented Dec 22, 2021

It didn't change the state of this issue. I had re-tested after that PR and it was realized to be an accident to close it.

@dfawley dfawley added the P2 label Mar 2, 2022
@menghanl
Copy link
Contributor

This is now failing with

2022/05/20 10:58:08 could not greet: rpc error: code = Unavailable desc = name resolver error: xds: error received from xDS stream: rpc error: code = Unauthenticated desc = transport: per-RPC creds failed due to error: context canceled, xDS client nodeID: id:"projects/5/networks/default/nodes/5de01256-7137-4488-8f80-5cdd836ce049" cluster:"cluster" metadata:{fields:{key:"INSTANCE_IP" value:{string_value:"127.0.1.1"}} fields:{key:"TRAFFICDIRECTOR_GCP_PROJECT_NUMBER" value:{string_value:"5"}} fields:{key:"TRAFFICDIRECTOR_NETWORK_NAME" value:{string_value:"default"}}} locality:{} user_agent_name:"gRPC Go" user_agent_version:"1.47.0-dev" client_features:"envoy.lb.does_not_support_overprovisioning" client_features:"xds.config.resource-in-sotw"

In xds: error received from xDS stream: rpc error: code = Unauthenticated desc = transport: per-RPC creds failed due to error: context canceled, it's not clear where context canceled comes from (probably from OAuth2)

@menghanl
Copy link
Contributor

I think it's OK to close this issue now.

@ejona86
Copy link
Member Author

ejona86 commented May 20, 2022

@menghanl, "per-RPC creds failed due to error: context canceled" makes it seem like a timeout. How is it clear there is a permission issue? We shouldn't be losing the ADS failure status. It seems even low-level debugging doesn't include it.

It looks like in Java we deliver ADS stream closures as errors to all the watchers. We might want to discuss if that is appropriate cross-language.

@ejona86
Copy link
Member Author

ejona86 commented May 20, 2022

@menghanl, in my test I had local credentials available (for the wrong project). In your test it is probably failing to talk to the metadata server or something.

@ejona86 ejona86 reopened this May 20, 2022
@ejona86
Copy link
Member Author

ejona86 commented May 20, 2022

Using td-grpc-bootstrap --gcp-project-number 5 and changing the channel creds to "insecure", I get:

grpc-go/examples/features/xds/client$ time ./client
2022/05/20 12:40:32 could not greet: rpc error: code = Unavailable desc = name resolver error: xds: ListenerResource target localhost:50051 not found, watcher timeout, xDS client nodeID: id:"projects/5/networks/default/nodes/fcee1e8b-301c-4cbe-adfd-44a694972a27"  cluster:"cluster"  metadata:{fields:{key:"INSTANCE_IP"  value:{string_value:"10.0.2.100"}}  fields:{key:"TRAFFICDIRECTOR_GCP_PROJECT_NUMBER"  value:{string_value:"5"}}  fields:{key:"TRAFFICDIRECTOR_NETWORK_NAME"  value:{string_value:"default"}}}  locality:{}  user_agent_name:"gRPC Go"  user_agent_version:"1.47.0-dev"  client_features:"envoy.lb.does_not_support_overprovisioning"  client_features:"xds.config.resource-in-sotw"

real	0m15.045s
user	0m0.058s
sys	0m0.042s

So nothing has changed. (Changing to insecure should have similar results to running on GCE or setting GOOGLE_APPLICATION_CREDENTIALS)

@easwars
Copy link
Contributor

easwars commented Nov 28, 2022

I will circle back to this once the xdsClient API changes are fully merged.

@easwars
Copy link
Contributor

easwars commented Feb 22, 2023

I created a trash bootstrap config with server_uri set to trafficdirector.googleapis.com:443, creds set to insecure and a project number of 5. Then I ran the unmodified xds client (which sets an RPC deadline of 10s) and saw the following:

  • When the http2 transport is being created, the first step is the transport credentials handshake. The Go implementation of insecure creds does nothing as part of the handshake and always returns a success. See:
    func (insecureTC) ClientHandshake(ctx context.Context, _ string, conn net.Conn) (net.Conn, credentials.AuthInfo, error) {
  • This means that the transport creation process moves forward and spawns the reader and writer goroutines. The reader goroutine fails to read the server preface (because the server is expecting a TLS client hello instead), and the transport creation fails with a temporary error. And we see the following logs:
2023/02/22 10:33:23 WARNING: [core] [Channel #2 SubChannel #3] grpc: addrConn.createTransport failed to connect to {
  "Addr": "trafficdirector.googleapis.com:443",
  "ServerName": "trafficdirector.googleapis.com:443",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "error reading server preface: read tcp [2601:646:c200:fe:e196:dda4:4cb7:733f]:54300->[2607:f8b0:4005:802::200a]:443: read: connection reset by peer"
  • Since the returned error is temporary, the channel attempts to recreate the transport (with an exponential backoff) until the RPC deadline exceeds, at which point the RPC fails with a not so helpful error.
2023/02/22 10:33:33 could not greet: rpc error: code = DeadlineExceeded desc = context deadline exceeded

When I increase the RPC deadline to something like 30s (to ensure that it is greater than the default resource watch deadline of 15s), we see the same thing happening with the transport creation until the watch timer fires, at which point, we see the following logs:

2023/02/22 10:52:19 WARNING: [xds] [xds-client 0xc000382550] Watch for resource "localhost:50051" of type ListenerResource timed out
2023/02/22 10:52:19 INFO: [xds] [xds-resolver 0xc00032aaf0] received LDS update: {
  "RouteConfigName": "",
  "InlineRouteConfig": null,
  "MaxStreamDuration": 0,
  "HTTPFilters": null,
  "InboundListenerCfg": null,
  "Raw": null
}, err: resource name "localhost:50051" of type Listener not found in received response
2023/02/22 10:52:19 WARNING: [xds] [xds-resolver 0xc00032aaf0] Watch error on resource localhost:50051 from xds-client 0xc000373b60, resource name "localhost:50051" of type Listener not found in received response

At this point, the xDS resolver sends an empty service config on the channel, which results in the pick_first LB policy being selected, and we see the following logs:

2023/02/22 10:52:19 INFO: [core] [Channel #1] Resolver state updated: {
  "Addresses": null,
  "ServiceConfig": {
    "Config": {
      "Config": null,
      "LB": null,
      "Methods": {}
    },
    "Err": null
  },
  "Attributes": null
} (service config updated)
2023/02/22 10:52:19 INFO: [core] [Channel #1] Channel switches to new LB policy "pick_first"
2023/02/22 10:52:19 INFO: [core] pickfirstBalancer: ResolverError called with error: produced zero addresses
2023/02/22 10:52:19 INFO: [core] [Channel #1] Channel Connectivity change to TRANSIENT_FAILURE

This then immediately leads to RPC failure:

2023/02/22 10:52:19 could not greet: rpc error: code = Unavailable desc = name resolver error: produced zero addresses

@easwars
Copy link
Contributor

easwars commented Feb 22, 2023

@ejona86 @dfawley
I have a couple of questions here:

  • Should the client handshake have failed leading to a quick RPC failure? Is that the behavior of the insecure credentials in Java?
  • We have a WARNING log line from the xds-client saying the watch timed out. And a WARNING log line from the xds-resolver saying the request resource was not found in received response. Do you think this is good enough? I agree that the eventual RPC error is a bit vague here.

What do you think?

@ejona86
Copy link
Member Author

ejona86 commented Feb 22, 2023

Since the returned error is temporary, the channel attempts to recreate the transport (with an exponential backoff) until the RPC deadline exceeds, at which point the RPC fails with a not so helpful error.

That goes into a previous conversation about how grpc-go should not be using wait-on-ready for the ADS stream because it can't fail with useful error messages (nor promptly).

We have a WARNING log line from the xds-client saying the watch timed out. And a WARNING log line from the xds-resolver saying the request resource was not found in received response. Do you think this is good enough?

Timeouts and "not found" aren't helpful and are really misleading. The "error reading server preface" is the error that is useful.

@easwars
Copy link
Contributor

easwars commented Feb 23, 2023

That goes into a previous conversation about how grpc-go should not be using wait-on-ready for the ADS stream because it can't fail with useful error messages (nor promptly).

A57 changes are currently in review here: #5996, and there we are getting rid of the wait-for-ready call option.

Timeouts and "not found" aren't helpful and are really misleading. The "error reading server preface" is the error that is useful.

Are you of the opinion that we should not be logging the timeout and "not found" errors?

@ejona86
Copy link
Member Author

ejona86 commented Feb 23, 2023

Are you of the opinion that we should not be logging the timeout and "not found" errors?

My main opinion is those errors are wrong today. Once you fix the wait-for-ready you should no longer time out and convert that to not found. It isn't so much "log or don't log" but "the code should stop going through those flows."

@easwars
Copy link
Contributor

easwars commented Mar 2, 2023

After the A57 changes were merged, the client does not time out, and instead throws the following error:

2023/03/02 08:33:51 could not greet: rpc error: code = Unavailable desc = name resolver error: xds: error received from xDS stream: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: unexpected EOF"

Thanks @ejona86 !!

I'm closing this now since I feel like things are working as expected. Please let us know if you feel otherwise.

FYI @arvindbr8

@easwars easwars closed this as completed Mar 2, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants