New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool #10603

shellfu · 2021-07-12T18:07:01Z

Note from @lkysow: I'm moving this to hashicorp/consul because the discuss post shows a user on EC2 also saw this error.

Overview of the Issue

New 1.10.0 on New K8s Cluster results in [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.200.65.16:8300 0 consul-server-2.primary <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.200.65.16:8300: operation was canceled". Reconnecting...

These WARNS appear in both the server and clients.

Reproduction Steps

When running helm install with the following values.yml:

client:
  enabled: true
  grpc: true
connectInject:
  aclBindingRuleSelector: serviceaccount.name!=default
  default: false
  enabled: true
  metrics:
    defaultEnableMerging: true
    defaultEnabled: true
    defaultMergedMetricsPort: 20100
    defaultPrometheusScrapePath: /metrics
    defaultPrometheusScrapePort: 20200
  transparentProxy:
    defaultEnabled: true
    defaultOverwriteProbes: true
controller:
  enabled: true
dns:
  enabled: true
global:
  acls:
    createReplicationToken: true
    manageSystemACLs: true
  datacenter: primary
  enabled: true
  federation:
    createFederationSecret: true
    enabled: true
  gossipEncryption:
    secretKey: key
    secretName: consul-gossip-encryption-key
  image: hashicorp/consul:1.10.0
  imageEnvoy: envoyproxy/envoy-alpine:v1.18.3
  imageK8S: hashicorp/consul-k8s:0.26.0
  logJSON: true
  metrics:
    agentMetricsRetentionTime: 1m
    enableAgentMetrics: false
    enableGatewayMetrics: true
    enabled: true
  name: consul
  tls:
    enableAutoEncrypt: true
    enabled: true
    httpsOnly: true
    serverAdditionalDNSSANs:
    - '*.consul'
    - '*.svc.cluster.local'
    - '*.my.customdomain.com'
    verify: false
meshGateway:
  enabled: true
  service:
    enabled: true
    port: 443
    type: LoadBalancer
  wanAddress:
    port: 443
    source: Service
server:
  bootstrapExpect: 5
  connect: true
  disruptionBudget:
    enabled: true
    maxUnavailable: 2
  enabled: true
  extraConfig: "{\n  \"primary_datacenter\": \"primary\",\n  \"performance\": {\n
    \     \"raft_multiplier\": 3\n  },\n  \"dns_config\": {\n    \"allow_stale\":
    true,\n    \"cache_max_age\": \"10s\",\n    \"enable_additional_node_meta_txt\":
    false,\n    \"node_ttl\": \"1m\",\n    \"soa\": {\n        \"expire\": 86400,
    \n        \"min_ttl\": 30,\n        \"refresh\": 3600,\n        \"retry\": 600\n
    \   },\n    \"use_cache\": true\n}}"
  replicas: 5
  resources:
    limits:
      cpu: 500m
      memory: 10Gi
    requests:
      cpu: 500m
      memory: 10Gi
  storage: 10Gi
  updatePartition: 0
syncCatalog:
  default: true
  enabled: true
  nodePortSyncType: ExternalFirst
  syncClusterIPServices: true
  toConsul: true
  toK8S: true
ui:
  enabled: true
  metrics:
    baseURL: http://mon-kube-prometheus-stack-prometheus.monitoring.svc.cluster.local
    enabled: true
    provider: prometheus
  service:
    enabled: true
    type: NodePort

Expected behavior

WARNS should not be flooding the log and connections should be over 8301 not 8300

Environment details

If not already included, please provide the following:

consul-k8s version:0 .26.0
consul-helm version: 0.32.1
values.yaml used to deploy the helm chart: see above

Additional Context

It seems others are experiencing the same problem.
https://discuss.hashicorp.com/t/grpc-warning-on-consul-1-10-0/26237

The text was updated successfully, but these errors were encountered:

lkysow · 2021-07-12T23:17:11Z

Hi, based on https://discuss.hashicorp.com/t/grpc-warning-on-consul-1-10-0/26237 it sounds like this issue is not specific to Kubernetes. I'm going to move this to hashicorp/consul.

dnephin · 2021-07-13T20:28:47Z

Thank you for reporting this issue!

I was just running a Consul agent locally to debug a different issue and I noticed this problem happens at the same time as these 2 debug lines:

2021-07-13T20:26:20.707Z [DEBUG] agent.router.manager: Rebalanced servers, new active server: number_of_servers=2 active_server="a19bd98836ec.dc1 (Addr: tcp/172.20.0.2:8300) (DC: dc1)"
2021-07-13T20:26:20.707Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {172.20.0.3:8300 0 0cc9dd0254a2.dc1 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.20.0.3:8300: operation was canceled". Reconnecting...
2021-07-13T20:26:20.707Z [DEBUG] agent.router.manager: Rebalanced servers, new active server: number_of_servers=1 active_server="a19bd98836ec (Addr: tcp/172.20.0.2:8300) (DC: dc1)"

The problem seems to be that when we rebalance the servers the active transport is cancelled, which causes this error to be printed.

mikemorris · 2021-07-14T14:55:55Z

Is the issue here that the behavior is potentially incorrect, or that a common occurrence is erroneously categorized at WARN log level?

shellfu · 2021-07-20T20:59:31Z

I've installed in a couple other locations with the same chart/values as above and in the same datacenter the warn messages are for the other consul-servers in the cluster. This occurs if the cluster is WAN federated or not, that doesn't appear to have an impact.

Currently, trying to track a couple of network issues I have been experiencing in consul 1.10.

I am trying to obtain more evidence but I deleted the 1.10 cluster and went back to 1.8.4 and it did not appear to have the WARN. Can this be ignored? Not sure yet.

ikonia · 2021-07-21T10:29:34Z

seeing an exact mirror of of this problem on a small development cluster running on Raspberry PI 4's in a very basic configuration all running consul 1.10.1

the errors in my case are the server taking to itself,

eg: my 3 raft servers are made up of 3 nodes - called:

nog
wesley
jake

Node ID Address State Voter RaftProtocol
wesley.no-dns.co.uk 5e8a186b-adb5-ebba-eeb4-e10656568adf 10.11.216.81:8300 leader true 3
nog.no-dns.co.uk 086a7491-bf09-c7e2-9151-74c817ffb74c 10.11.216.182:8300 follower true 3
jake.no-dns.co.uk aa37ec78-d438-8726-77e5-c5619dfb054a 10.11.216.234:8300 follower true 3

in nog's log
Jul 21 10:21:35 nog consul[5613]: 2021-07-21T10:21:35.048Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.11.216.182:8300 0 nog.no-dns.co.uk.bathstable }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.11.216.182:8300: operation was canceled". Reconnecting...

the IP address 10.11.216.182 is actually the IP address of the host 'nog' - so the error is talking to itself

on the host 'jake' the log shows the same failure to connect to the host nog
Jul 21 10:17:17 jake consul[7781]: 2021-07-21T10:17:17.833Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.11.216.182:8300 0 nog.no-dns.co.uk.bathstable }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.11.216.182:8300: operation was canceled". Reconnecting...

on the host wesley (leader)

Jul 21 10:19:19 wesley consul[15814]: 2021-07-21T10:19:19.779Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.11.216.81:8300 0 wesley.no-dns.co.uk.bathstable }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.11.216.81:8300: operation was canceled". Reconnecting...

the 10.11.216.81 IP address it's failing to talk to is wesley - itself in this case.

shellfu · 2021-07-21T16:10:26Z

Yeah, that is what I am seeing. It's the local cluster that is emitting these messages and FLOODING the logs

kornface13 · 2021-07-23T02:40:40Z

I see the same issue. Three node cluster running on VMs (Cent 8).

Consul v1.10.0
Revision 27de64d
Running on Cent 8 VMs.

All three nodes sporadically spit out an error about connecting to one of the other master nodes.

agent: grpc: addrConn.createTransport failed to connect to {10.248.14.54:8300 0 consul02.c.blah.internal <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.248.14.54:8300: operation was canceled". Reconnecting...

shellfu · 2021-07-26T15:59:40Z

I went ahead and did more installation tests.

I installed Consul 1.10.1 and Chart 0.32.1 and backed down the Consul and Chart version all the way to 1.8.4 and 0.26.0 as I am also experiencing other problems that are not related to this issue.

The WARNS appear in the latest 1.10.x versions and they are emitted in both the local datacenter as well as a federated environment using mesh gateways.

sri4kanne · 2021-08-06T14:07:35Z

We are also seeing similar errors but there seems to be no issue with the cluster itself, we are running consul v1.10.1 on OL8 VM's.

therealhanlin · 2021-08-08T19:32:09Z

@ikonia I'm running Consul on two Ras Pis, and ran into this issue couple of days ago as well. In case you haven't found a solution, I seems to find the cause for my issue.

I first noticed the inconsistency in member status shown on each Pi. As you can see from snippets below, Pi 1 seems to think that 02 is leaving the cluster, and 02 thinks it's still in it. So I restarted the consul service on 02 and that fixed the issue.

I think the problem was caused due to starting services on both Pis at the same time and the nodes didn't negotiate properly and somehow that caused this weird bug. I use Ansible for staging the node and manage configs on them, and whenever I changed the configs, it restarts the services at the same time on all nodes and that's not a good idea .... (duh ... ).

I'm not sure what you setup is, but maybe try to spin up the nodes one by one, which solved the problem for me.

raspi01:
Node           Address            Status   Type    Build   Protocol  DC     Segment
raspi01.raspi  192.168.1.10:8302  alive    server  1.10.1  2         raspi  <all>
raspi02.raspi  192.168.1.11:8302  leaving  server  1.10.1  2         raspi  <all>

raspi02:
Node           Address            Status  Type    Build   Protocol  DC     Segment
raspi01.raspi  192.168.1.10:8302  alive   server  1.10.1  2         raspi  <all>
raspi02.raspi  192.168.1.11:8302  alive   server  1.10.1  2         raspi  <all>

shellfu · 2021-08-09T16:44:37Z

Restarts in the way you describe unfortunately does not solve the problem here. The clusters appear to be healthy otherwise, but this is flooding logs and I do not think we have received a response as to if the messages are indicative of an issue or is something that can be ignored and waiting for a patch.

dnephin · 2021-08-09T16:53:32Z

I believe these messages can be ignored. We periodically rebalance servers in the connect pool, and it looks like doing so is causing gRPC to emit these warnings. It seems like gRPC is reconnecting after the rebalance, so likely we can move these messages to INFO instead of WARN, but we'll need to do more investigation to be sure.

isality · 2021-08-14T19:17:25Z

+1 It seems like not ok.

consul v1.10.1

avoidik · 2021-09-06T10:42:59Z

we're having the same issue on ent version, will try to raise support ticket there

drawks · 2021-09-17T16:41:26Z

💯 this should be moved to an info level log, normal system behavior that doesn't result in any degradation and self heals should not be something we are warned about.

Peter-consul · 2021-10-01T20:51:18Z

I have the same issue after upgraded to 1.10.2.

weastur · 2021-10-04T12:27:13Z

@dnephin Is there any chance to fix this in an upcoming release?

jkirschner-hashicorp · 2021-10-15T00:06:28Z

To clarify our current understanding of this: this is not a bug, but instead a misclassified log message (that shouldn't be WARN).

Per @dnephin:

I believe these messages can be ignored. We periodically rebalance servers in the connect pool, and it looks like doing so is causing gRPC to emit these warnings. It seems like gRPC is reconnecting after the rebalance, so likely we can move these messages to INFO instead of WARN, but we'll need to do more investigation to be sure.

In this case, the aforementioned "need to do more investigation" is about how to make the change to reduce verbosity, not about the cause or whether there's a bug. The change requires some investigation because the message is emitted by gRPC, not Consul.

adamw-linadm · 2022-01-11T14:13:48Z

Exactly this same error on 1.11.1 bare metal/ centos 7

dnephin · 2022-01-11T21:49:03Z

If this log message was coming directly from Consul this would be much easier to fix. Unfortunately the log message is coming from a library (gRPC), which makes it a bit harder to fix.

I think we have two options for addressing this:

change all gRPC WARN messages to INFO
change how we modify the grpc ClientConnPool so that it does not warn

Option 1 is pretty safe, but I'm not sure if it fixes much. There will still be an INFO log message that is printed periodically. I guess it is slightly better to print this as an INFO than a WARN. The downside is that other gRPC WARN messages may not be visible enough in logs at INFO level.

Option 2 is much more involved, but is likely a safer long term fix. I believe the cause of this warning is this code:

consul/agent/grpc/resolver/resolver.go

Lines 283 to 287 in d20230f

    
           r.clientConn.UpdateState(resolver.State{Addresses: firstAddr}) 
        
           // Call UpdateState again with the entire list of addrs in case we need them 
        
           // for failover. 
        
           r.clientConn.UpdateState(resolver.State{Addresses: addrs})

If we trace those UpdateState calls, we'll see we can end up in the code that logs that warning. My rough understanding is that by updating the client conn state we are cancelling the dial operation, which prints this message. It may be that by calling UpdateState twice like this is what triggers the warning. Under normal operation I guess the dial would complete and we wouldn't see a warning. This also shows why we can't just hide the message, because if the dial operation was failing for some other reason, we'd want to know about it. The "further investigation" would be to see if we could remove the need to call UpdateState twice, and to confirm that is sufficient to prevent the warning.

idrennanvmware · 2022-01-22T13:11:54Z

We're seeing a pretty healthy amount of these messages as well across our clusters. Keeping our eyes on this.

Given the above, option 2 is definitely our preference. Not sure we want to even get a message in this case unless it's something to be concerned about

danlsgiga · 2022-01-26T22:54:50Z

Still happening on 1.11.2 running on my homelab.

+1 to Option 2

NagenderPulluri · 2022-02-24T13:13:23Z

Can we ignore this error if everything is working as expected ? or Do we need to concern about this warning/error ?

Amier3 · 2022-02-28T16:50:35Z

Hey @nagender1005

Yes you can ignore this warning if everything is working as expected. Per earlier in this thread:

I believe these messages can be ignored. We periodically rebalance servers in the connect pool, and it looks like doing so is causing gRPC to emit these warnings. It seems like gRPC is reconnecting after the rebalance, so likely we can move these messages to INFO instead of WARN, but we'll need to do more investigation to be sure.

Hope this helps!

chrisvanmeer · 2022-03-01T09:23:07Z

Still happening in Consul v1.11.3 as well.

scottnemes · 2022-04-28T18:10:30Z

Still happening in Consul v1.12.0 .

kisunji · 2022-12-19T18:09:36Z

PR #15701 has been merged and should land in the next patch versions for 1.14.x, 1.13.x and 1.12.x. Keeping this issue open for now to wait for feedback from the community.

My PR should fix the periodic WARN logs during server shuffling which occurs every ~2 mins by default.

Note that you may continue to encounter some WARNs on agent startup and on very infrequent occasions. This is a related but separate issue #15821

kong62 · 2023-01-06T08:18:02Z

same on consul 1.14.2

2023-01-06T05:49:31.453Z [WARN]  agent: [core][Channel #1 SubChannel #1017602] grpc: addrConn.createTransport failed to connect to {
  "Addr": "dc1-192.168.48.74:8300",
  "ServerName": "consul-server-2.dc1",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->192.168.48.74:8300: operation was canceled"

jkirschner-hashicorp · 2023-01-06T16:28:03Z

Hi @kong62,

The issue is expected to still be present in 1.14.2. It will be fixed as of the next set of patch releases (1.14.4, 1.13.6, and 1.12.9).

kong62 · 2023-01-09T05:48:12Z

Hi @kong62,

The issue is expected to still be present in 1.14.2. It will be fixed as of the next set of patch releases (1.14.4, 1.13.6, and 1.12.9).

thank you

chrisvanmeer · 2023-02-05T22:48:47Z

Can confirm that after upgrading to v1.14.4 I did not experience these messages during normal operation for one whole day now. Will continue monitoring.

david-yu · 2023-02-06T01:58:27Z

Thanks will go ahead and close this issue as it looks like this is fixed for 1.14.4 and later. Note that you may continue to encounter some WARNs on agent startup and on very infrequent occasions. This is a related but separate issue #15821

urosgruber · 2023-02-06T07:09:45Z

I've installed 1.12.9 and I can still see this messages. Not that frequent but still

2023-02-06T05:50:55.844+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.5:8300 consul-3 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.5:8300: operation was canceled". Reconnecting...
2023-02-06T05:57:43.781+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.3:8300 consul-1.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.3:8300: operation was canceled". Reconnecting...
2023-02-06T05:58:10.466+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
2023-02-06T06:50:09.474+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
2023-02-06T07:13:16.970+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
2023-02-06T07:18:54.830+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
2023-02-06T07:36:53.408+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.5:8300 consul-3 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.5:8300: operation was canceled". Reconnecting...

webant · 2023-02-12T11:32:22Z

v1.14.4 This one is still reproducible:
agent: [core][Channel #1 SubChannel #21] grpc: addrConn.createTransport failed to connect to {
"Addr": "xx.xx.xx.xx",
"ServerName": "xx.xx.xx.xx",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}. Err: connection error: desc = "transport: Error while dialing dial tcp ->xx.xx.xx.xx:8300: operation was canceled"

marianatkach · 2023-02-17T14:00:08Z

I have same problem on my Consul Cluster
v1.14.4

[WARN] agent: [core][Channel #1 SubChannel #15] grpc: addrConn.createTransport failed to connect to {
"Addr": "xx.xx.xx.xx",
"ServerName": "xx.xx",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}. Err: connection error: desc = "transport: Error while dialing dial tcp ->xx.xx.xx.xx: operation was canceled"

kisunji · 2023-02-21T14:30:45Z

@urosgruber @webant @mexdevops Do those warns occur regularly? We still expect to see those logs as an agent starts up but they should become infrequent over time. #15821 for reference

marianatkach · 2023-02-21T14:37:43Z

@urosgruber @webant @mexdevops Do those warns occur regularly? We still expect to see those logs as an agent starts up but they should become infrequent over time. #15821 for reference

Yep, i have this messages from time to time and i can't fix this.
Also, i don't know anything about this problem...

nhenneaux · 2023-02-21T14:40:42Z

It appears every 2-3 days for each agent. Not related to any startup, quite constant flow of those with several hundreds of agents with ~100 of those errors every day.

urosgruber · 2023-02-21T15:21:30Z

@urosgruber @webant @mexdevops Do those warns occur regularly? We still expect to see those logs as an agent starts up but they should become infrequent over time. #15821 for reference

With 3 servers, no agents connected, fresh install about 10 messages a day on random.

kisunji · 2023-02-21T17:45:25Z

Could you provide log fragments with lines before and after you see these WARNs?

I'd like to confirm that they are not caused by this issue (server shuffling every ~2 mins) and would like to move the investigation to either #15821 or a new issue if necessary.

urosgruber · 2023-02-21T19:59:57Z

Do you need INFO level? I believe there is nothing else in case I set it to WARN

kisunji · 2023-02-21T20:17:38Z

Do you need INFO level? I believe there is nothing else in case I set it to WARN

Yes, INFO would be appreciated.

urosgruber · 2023-02-22T19:52:20Z

2023-02-22T11:14:22.693+0100 [INFO]  agent: (LAN) joining: lan_addresses=[10.192.0.3, 10.192.0.4, 10.192.0.5]
2023-02-22T11:14:22.698+0100 [INFO]  agent: started state syncer
2023-02-22T11:14:22.699+0100 [INFO]  agent: Consul agent running!
2023-02-22T11:14:22.699+0100 [INFO]  agent: (LAN) joined: number_of_nodes=3
2023-02-22T11:14:22.699+0100 [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
2023-02-22T11:14:22.720+0100 [WARN]  agent.server.raft: failed to get previous log: previous-index=48302 last-index=48300 error="log not found"
2023-02-22T11:14:23.024+0100 [INFO]  agent: Synced node info
2023-02-22T11:16:22.675+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.3:8300 consul-1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.3:8300: operation was canceled". Reconnecting...
2023-02-22T12:21:01.609+0100 [WARN]  agent.server.raft: skipping application of old log: index=48795
2023-02-22T12:21:47.875+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.3:8300 consul-1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.3:8300: operation was canceled". Reconnecting...
2023-02-22T13:09:58.024+0100 [INFO]  agent.fsm: snapshot created: duration=236.985µs
2023-02-22T13:09:58.025+0100 [INFO]  agent.server.raft: starting snapshot up to: index=49157
2023-02-22T13:09:58.025+0100 [INFO]  agent.server.snapshot: creating new snapshot: path=/var/db/consul/raft/snapshots/2-49157-1677067798025.tmp
2023-02-22T13:09:58.030+0100 [INFO]  agent.server.snapshot: reaping snapshot: path=/var/db/consul/raft/snapshots/2-16387-1676800266639
2023-02-22T13:09:58.031+0100 [INFO]  agent.server.raft: compacting logs: from=22534 to=38917
2023-02-22T13:09:58.056+0100 [INFO]  agent.server.raft: snapshot complete up to: index=49157
2023-02-22T13:46:17.188+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...
2023-02-22T14:15:21.288+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.3:8300 consul-1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.3:8300: operation was canceled". Reconnecting...
2023-02-22T15:04:57.369+0100 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-10.192.0.4:8300 consul-2.dc1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->10.192.0.4:8300: operation was canceled". Reconnecting...

nhenneaux · 2023-02-23T19:18:11Z

From my side, no info logs around those for the whole cluster. A single agent impacted at a time.

ikonia · 2023-06-22T06:49:33Z

can we confirm if this is fixed, it's marked as fixed but there are still reports in this thread of the problem

kisunji · 2023-06-22T15:26:58Z

Hello everyone,

I created issue #17842 to track the problem which I consider separate from this one. Although they are related, it is helpful for us to triage the issue separately given that the WARN logs are no longer frequent or regular.

If you are still seeing WARN logs ending with operation was canceled, please help us investigate by providing DEBUG logs in #17842

lkysow transferred this issue from hashicorp/consul-helm Jul 12, 2021

dnephin added theme/streaming Related to Streaming connections between server and client type/bug Feature does not function as expected labels Jul 13, 2021

dnephin changed the title ~~New 1.10.0 on New K8s Cluster results in [WARN]:~~ New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool Oct 13, 2021

huikang mentioned this issue Dec 16, 2022

Grpc failed to connect to #15766

Closed

kisunji mentioned this issue Dec 16, 2022

Unnecessary gRPC rebalancing leads to occasional WARN logs #15821

Closed

kisunji closed this as completed in #15701 Dec 19, 2022

kisunji reopened this Dec 19, 2022

david-yu closed this as completed Feb 6, 2023

obourdon mentioned this issue Feb 6, 2023

Consul connection error on port 8300 #14464

Closed

kisunji mentioned this issue Jun 22, 2023

Investigation: grpc WARN logs "addrConn.createTransport failed to connect to..." #17842

Open

New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool #10603

New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool #10603

Comments

shellfu commented Jul 12, 2021 • edited by lkysow

Overview of the Issue

Reproduction Steps

Expected behavior

Environment details

Additional Context

lkysow commented Jul 12, 2021

dnephin commented Jul 13, 2021

mikemorris commented Jul 14, 2021

shellfu commented Jul 20, 2021

ikonia commented Jul 21, 2021

shellfu commented Jul 21, 2021

kornface13 commented Jul 23, 2021

shellfu commented Jul 26, 2021

sri4kanne commented Aug 6, 2021

therealhanlin commented Aug 8, 2021

shellfu commented Aug 9, 2021

dnephin commented Aug 9, 2021

isality commented Aug 14, 2021

avoidik commented Sep 6, 2021

drawks commented Sep 17, 2021

Peter-consul commented Oct 1, 2021 • edited

weastur commented Oct 4, 2021

jkirschner-hashicorp commented Oct 15, 2021

adamw-linadm commented Jan 11, 2022

dnephin commented Jan 11, 2022

idrennanvmware commented Jan 22, 2022

danlsgiga commented Jan 26, 2022

NagenderPulluri commented Feb 24, 2022

Amier3 commented Feb 28, 2022

chrisvanmeer commented Mar 1, 2022

scottnemes commented Apr 28, 2022

kisunji commented Dec 19, 2022 • edited

kong62 commented Jan 6, 2023

jkirschner-hashicorp commented Jan 6, 2023

kong62 commented Jan 9, 2023

chrisvanmeer commented Feb 5, 2023

david-yu commented Feb 6, 2023

urosgruber commented Feb 6, 2023

webant commented Feb 12, 2023 • edited

marianatkach commented Feb 17, 2023

kisunji commented Feb 21, 2023

marianatkach commented Feb 21, 2023

nhenneaux commented Feb 21, 2023

urosgruber commented Feb 21, 2023

kisunji commented Feb 21, 2023

urosgruber commented Feb 21, 2023

kisunji commented Feb 21, 2023

urosgruber commented Feb 22, 2023

nhenneaux commented Feb 23, 2023

ikonia commented Jun 22, 2023

kisunji commented Jun 22, 2023

shellfu commented Jul 12, 2021 •

edited by lkysow

Peter-consul commented Oct 1, 2021 •

edited

kisunji commented Dec 19, 2022 •

edited

webant commented Feb 12, 2023 •

edited