Upgrading from 2.10.2 -> 2.11.0 - Kube Client "error trying to connect: tls handshake eof" in Policy Controller #7098

sigurdfalk · 2021-10-15T07:10:00Z

Bug Report

What is the issue?

We are having issues upgrading from 2.10.2 -> 2.11.0. It seems to be related to the new Policy Controller. The controller fails starting up, apparently when the Kube Client is trying to watch resources: "kube::client: failed with error error trying to connect: tls handshake eof". Logs:

2021-10-06T11:36:02.462254Z  INFO grpc{addr=0.0.0.0:8090 cluster_networks=[10.0.0.0/8, 100.64.0.0/10, 172.16.0.0/12, 192.168.0.0/16]}: linkerd_policy_controller: gRPC server listening addr=0.0.0.0:8090
2021-10-06T11:36:02.462275Z  INFO serve{addr=0.0.0.0:9990}: linkerd_policy_controller::admin: HTTP admin server listening addr=0.0.0.0:9990
2021-10-06T11:36:02.462913Z  INFO linkerd_policy_controller: Admission controller server listening addr=0.0.0.0:9443
2021-10-06T11:36:02.469200Z ERROR servers: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469221Z  INFO servers: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:02.469455Z ERROR pods: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469476Z  INFO pods: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:02.469513Z ERROR serverauthorizations: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469525Z  INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:03.470663Z  INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Restarting
2021-10-06T11:36:03.470763Z  INFO pods: linkerd_policy_controller_k8s_api::watch: Restarting
2021-10-06T11:36:03.470804Z  INFO servers: linkerd_policy_controller_k8s_api::watch: Restarting

We are able to track the request to our API Server which seem to respond with status 200, but indicating that "Connection closed early":

{
   "kind":"Event",
   "apiVersion":"audit.k8s.io/v1",
   "level":"Request",
   "auditID":"78cdd15d-3197-4437-99ae-c44504bea33b",
   "stage":"ResponseStarted",
   "requestURI":"/api/v1/endpoints?allowWatchBookmarks=true\u0026resourceVersion=477527\u0026timeout=5m14s\u0026timeoutSeconds=314\u0026watch=true",
   "verb":"watch",
   "user":{
      "username":"system:serviceaccount:linkerd:linkerd-destination",
      "uid":"edc0ceeb-f8df-4e52-b73b-65b2a420ee21",
      "groups":[
         "system:serviceaccounts",
         "system:serviceaccounts:linkerd",
         "system:authenticated"
      ],
      "extra":{
         "authentication.kubernetes.io/pod-name":[
            "linkerd-destination-9b7f77f5f-7dbj8"
         ],
         "authentication.kubernetes.io/pod-uid":[
            "51ee23ea-5b2c-46dd-bda6-2e998b8fb78c"
         ]
      }""
   },
   "sourceIPs":[
      "***"
   ],
   "userAgent":"controller/v0.0.0 (linux/amd64) kubernetes/$Format",
   "objectRef":{
      "resource":"endpoints",
      "apiVersion":"v1"
   },
   "responseStatus":{
      "metadata":{
         
      },
      "status":"Success",
      "message":"Connection closed early",
      "code":200
   },
   "requestReceivedTimestamp":"2021-10-06T11:30:55.194103Z",
   "stageTimestamp":"2021-10-06T11:36:09.196098Z",
   "annotations":{
      "authorization.k8s.io/decision":"allow",
      "authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"linkerd-linkerd-destination\" of ClusterRole \"linkerd-linkerd-destination\" to ServiceAccount \"linkerd-destination/linkerd\""
   }""
}

`linkerd check` output

We didn't save the output from linkerd check and have rolled back to 2.10.2 now. However, when we ran the check while having the issues, it reported all ✅

Environment

Kubernetes Version: 1.21.2
Cluster Environment: AKS
Host OS: Linux
Linkerd version: 2.11.0

Additional context

Linker 2.10.2 has been running in the same cluster without any issues. We had some trouble creating a certificate for the policy controller using Cert Manager as the Rust TLS http client used apparently dont support ECDSA. Solved this by switching to RSA as discussed in this thread on Linkerd Slack.

The text was updated successfully, but these errors were encountered:

olix0r · 2021-10-18T16:34:43Z

Thanks @sigurdfalk. This sounds similar to #7011. In this issue, we observed that the policy controller only works with a strict subset of ECDSA algorithms specified in the TLSv1.3 RFC:

          /* ECDSA algorithms */
          ecdsa_secp256r1_sha256(0x0403),
          ecdsa_secp384r1_sha384(0x0503),
          ecdsa_secp521r1_sha512(0x0603),

Can you share an example of a certificate that did not work?

sigurdfalk · 2021-10-19T14:45:41Z

@olix0r Thanks for the response.

We figured that the Rust tls http client that is used for the Policy Controller don't support ECDSA as the following cert did not work (see thread in Linkerd Slack):

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: linkerd-policy-validator
  namespace: linkerd
spec:
  secretName: linkerd-policy-validator-k8s-tls
  duration: 24h
  renewBefore: 1h
  issuerRef:
    name: webhook-issuer
    kind: Issuer
  commonName: linkerd-policy-validator.linkerd.svc
  dnsNames:
  - linkerd-policy-validator.linkerd.svc
  isCA: false
  privateKey:
    algorithm: ECDSA
  usages:
  - server auth

We then tried using RSA which was accepted by the Policy Controller, but then we started experiencing the error mentioned in this issue. We created the cert like this which is the one that did not work:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: linkerd-policy-validator
  namespace: ${local.linkerd_namespace}
spec:
  secretName: linkerd-policy-validator-k8s-tls
  duration: 24h
  renewBefore: 1h
  issuerRef:
    name: webhook-issuer
    kind: Issuer
  commonName: linkerd-policy-validator.linkerd.svc
  dnsNames:
  - linkerd-policy-validator.linkerd.svc
  isCA: false
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048
  usages:
  - server auth

olix0r · 2021-10-19T15:11:59Z

Interesting!

I think this issue is related to rustls/rustls#332 and kube-rs/kube#542. We see:

:; k get secrets test-cert -o json | jq -r '.data["tls.key"] | @base64d'
-----BEGIN EC PRIVATE KEY-----
...
-----END EC PRIVATE KEY-----

But rust's crypto libraries do not currently support PEM-formatted ECDSA private keys. See more discussion here.

For the time being we'll have to require that webhook credentials are RSA (or that ECDSA keys are provided in in DER format, though I doubt cert-manager supports that out of the box). We'll probably want to followup on djc's suggestion to implement a standalone PEM decoder for these cases.

olix0r · 2021-10-19T15:23:47Z

Actually...

If I configure my cert-manager certificate with:

  privateKey:
    algorithm: ECDSA
    encoding: PKCS8

I get credentials with:

:; k get secrets test-cert -o json | jq -r '.data["tls.key"] | @base64d'  |head -n 1
-----BEGIN PRIVATE KEY-----

Which might work. We'll give this a try later, or if you can try it and report back, that might save us some time ;)

olix0r · 2021-10-19T15:52:49Z

From some very brief testing, this seems to work. We should update the docs at https://linkerd.io/2.11/tasks/automatically-rotating-webhook-tls-credentials/ (https://github.com/linkerd/website) -- which doesn't even document the policy controller webhook config at the moment -- to reflect this.

olix0r · 2021-10-20T05:13:58Z

Docs updated linkerd/website#1221

olix0r · 2021-10-20T19:54:40Z

@sigurdfalk The updated docs at https://linkerd.io/2.11/tasks/automatically-rotating-webhook-tls-credentials/#issuing-certificates-and-writing-them-to-secrets should get you a working cluster with 2.11. It would be great if you confirm that this all works as expected (i.e. by adding encoding: PKCS8); but after that I think we can close this issue :)

sigurdfalk · 2021-10-20T21:16:34Z

@olix0r Thats great, thank you! I'm gonna try verifying this tomorrow 🙏🏻

sigurdfalk · 2021-10-21T08:26:35Z

@olix0r The policy controller now accepts the cert with ECDSA, however, we are still seing the same errors 🧐

I did a fresh install of v2.10.2 and then tried to upgrade to v2.11.0 with encoding: PKCS8 in the certificate. I still think this is really strange as all other components of Linkerd seems to work as expected and are able to communicate with the API Server.

New dump of errors:

{"timestamp":"2021-10-21T08:08:47.991233Z","level":"INFO","fields":{"message":"gRPC server listening","addr":"0.0.0.0:8090"},"target":"linkerd_policy_controller","spans":[{"addr":"0.0.0.0:8090","cluster_networks":"[10.0.0.0/8, 100.64.0.0/10, 172.16.0.0/12, 192.168.0.0/16]","name":"grpc"}]}
{"timestamp":"2021-10-21T08:08:47.991257Z","level":"INFO","fields":{"message":"HTTP admin server listening","addr":"0.0.0.0:9990"},"target":"linkerd_policy_controller::admin","spans":[{"addr":"0.0.0.0:9990","name":"serve"}]}
{"timestamp":"2021-10-21T08:08:47.991365Z","level":"INFO","fields":{"message":"Admission controller server listening","addr":"0.0.0.0:9443"},"target":"linkerd_policy_controller"}
{"timestamp":"2021-10-21T08:08:48.000102Z","level":"ERROR","fields":{"message":"failed with error error trying to connect: tls handshake eof"},"target":"kube::client","spans":[{"name":"pods"}]}
{"timestamp":"2021-10-21T08:08:48.000235Z","level":"INFO","fields":{"message":"Failed","error":"failed to perform initial object list: HyperError: error trying to connect: tls handshake eof"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:48.000384Z","level":"ERROR","fields":{"message":"failed with error error trying to connect: tls handshake eof"},"target":"kube::client","spans":[{"name":"servers"}]}
{"timestamp":"2021-10-21T08:08:48.000448Z","level":"INFO","fields":{"message":"Failed","error":"failed to perform initial object list: HyperError: error trying to connect: tls handshake eof"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:48.000550Z","level":"ERROR","fields":{"message":"failed with error error trying to connect: tls handshake eof"},"target":"kube::client","spans":[{"name":"serverauthorizations"}]}
{"timestamp":"2021-10-21T08:08:48.000576Z","level":"INFO","fields":{"message":"Failed","error":"failed to perform initial object list: HyperError: error trying to connect: tls handshake eof"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:49.000943Z","level":"INFO","fields":{"message":"Restarting"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:49.001056Z","level":"INFO","fields":{"message":"Restarting"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:49.002210Z","level":"INFO","fields":{"message":"Restarting"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}

And we still see the requests coming in to our API Server, which responds with status 200 OK. Here is a line from the API server log tracing back to the policy controller requesting /apis/split.smi-spec.io/v1alpha1/trafficsplits:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"359d181f-47bc-4bc9-ba3a-d2b52c9099bb","stage":"ResponseComplete","requestURI":"/apis/split.smi-spec.io/v1alpha1/trafficsplits?limit=500\u0026resourceVersion=0","verb":"list","user":{"username":"system:serviceaccount:linkerd:linkerd-destination","uid":"4820cbb9-bb93-4315-8121-1a83dd678e0d","groups":["system:serviceaccounts","system:serviceaccounts:linkerd","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["linkerd-destination-756758f5b-xfn6f"],"authentication.kubernetes.io/pod-uid":["0fe069f2-6624-445d-82de-28720cc0b957"]}},"sourceIPs":["***"],"userAgent":"controller/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"trafficsplits","apiGroup":"split.smi-spec.io","apiVersion":"v1alpha1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2021-10-21T08:08:09.650095Z","stageTimestamp":"2021-10-21T08:08:09.651179Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"linkerd-linkerd-destination\" of ClusterRole \"linkerd-linkerd-destination\" to ServiceAccount \"linkerd-destination/linkerd\""}}

Not sure if relevant, but we have a Azure Firewall between the cluster and the API Server. But this has never been an issue before.

olix0r · 2021-10-21T14:59:12Z

@sigurdfalk the new policy controller is written in Rust using kube-rs, whereas the older components are implemented in Go with client-go. So we're still working through some quirks of using the new library. We really appreciate the helpful feedback, though. I'm sorry that this upgrade hasn't been seamless -- hopefully we'll figure out how to catch this class of issue better.

Can you share the output of:

:; kubectl get secret $(kubectl get sa default -o json |jq  -r '.secrets[0].name') -o json |jq -r '.data["ca.crt"] | @base64d'

I'm curious about the parameters used in API server's CA certificate... We've definitely seen Linkerd 2.11 working fine in AKS...

And we still see the requests coming in to our API Server, which responds with status 200 OK. Here is a line from the API server log tracing back to the policy controller requesting /apis/split.smi-spec.io/v1alpha1/trafficsplits:

This is probably pedantic, but this is actually from the destination controller, not the policy controller (though both run in the same pod). So, I agree that probably confirms that it's not a firewall issue.

You could try installing 2.11 with policyController.logLevel=trace -- that should dump a lot more diagnostics in your logs, at least. Though, I'm not sure if that will give us a clearer error...

sigurdfalk · 2021-10-22T09:29:24Z

@olix0r We really appreciate Linkerd and are happy to keep debugging this 🙂

Cert output is:

The logs got really big when enabling TRACE, so a bit hard for me lacking knowledge og the Rust application to make too much sense of them. But I'll keep digging. I dumped all logs from application start in the attached file (1 second window):

1634894398_118505.txt

olix0r · 2021-10-22T17:10:39Z

@sigurdfalk Thanks! that certificate looks basically normal, as do the logs. We'll do some digging on a working cluster and see if we can come up with any differences.

sigurdfalk · 2021-11-01T05:17:46Z

@olix0r did you have some luck in your testing?

olix0r · 2021-11-02T16:07:34Z

@sigurdfalk Sorry, I don't think we have any leads on this issue yet. It's still on our radar, though.

stale · 2022-01-31T19:43:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

sigurdfalk · 2022-05-02T13:05:42Z

@olix0r we no longer have this issue with 2.11.2. Running perfectly i our clusters as we speak :) So can close this issue from our part

olix0r · 2022-05-02T15:33:40Z

@sigurdfalk Excellent. Thanks for confirming!

sigurdfalk changed the title ~~Upgrading from 2.10.2 -> 2.11.0 - Kub Client "error trying to connect: tls handshake eof" in Policy Controller~~ Upgrading from 2.10.2 -> 2.11.0 - Kube Client "error trying to connect: tls handshake eof" in Policy Controller Oct 15, 2021

olix0r added this to the stable-2.12.0 milestone Nov 2, 2021

sigurdfalk mentioned this issue Dec 8, 2021

Unable to install Linkerd, linkerd-identity can't access Kubernetes API #7398

Closed

stale bot added the wontfix label Jan 31, 2022

olix0r added area/controller area/policy env/aks Microsoft AKS and removed wontfix labels Feb 1, 2022

olix0r closed this as completed May 2, 2022

github-actions bot locked as resolved and limited conversation to collaborators Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading from 2.10.2 -> 2.11.0 - Kube Client "error trying to connect: tls handshake eof" in Policy Controller #7098

Upgrading from 2.10.2 -> 2.11.0 - Kube Client "error trying to connect: tls handshake eof" in Policy Controller #7098

sigurdfalk commented Oct 15, 2021 •

edited

olix0r commented Oct 18, 2021 •

edited

sigurdfalk commented Oct 19, 2021

olix0r commented Oct 19, 2021

olix0r commented Oct 19, 2021

olix0r commented Oct 19, 2021

olix0r commented Oct 20, 2021

olix0r commented Oct 20, 2021

sigurdfalk commented Oct 20, 2021

sigurdfalk commented Oct 21, 2021 •

edited

olix0r commented Oct 21, 2021 •

edited

sigurdfalk commented Oct 22, 2021

olix0r commented Oct 22, 2021

sigurdfalk commented Nov 1, 2021

olix0r commented Nov 2, 2021

stale bot commented Jan 31, 2022

sigurdfalk commented May 2, 2022 •

edited

olix0r commented May 2, 2022

Upgrading from 2.10.2 -> 2.11.0 - Kube Client "error trying to connect: tls handshake eof" in Policy Controller #7098

Upgrading from 2.10.2 -> 2.11.0 - Kube Client "error trying to connect: tls handshake eof" in Policy Controller #7098

Comments

sigurdfalk commented Oct 15, 2021 • edited

Bug Report

What is the issue?

linkerd check output

Environment

Additional context

olix0r commented Oct 18, 2021 • edited

sigurdfalk commented Oct 19, 2021

olix0r commented Oct 19, 2021

olix0r commented Oct 19, 2021

olix0r commented Oct 19, 2021

olix0r commented Oct 20, 2021

olix0r commented Oct 20, 2021

sigurdfalk commented Oct 20, 2021

sigurdfalk commented Oct 21, 2021 • edited

olix0r commented Oct 21, 2021 • edited

sigurdfalk commented Oct 22, 2021

olix0r commented Oct 22, 2021

sigurdfalk commented Nov 1, 2021

olix0r commented Nov 2, 2021

stale bot commented Jan 31, 2022

sigurdfalk commented May 2, 2022 • edited

olix0r commented May 2, 2022

sigurdfalk commented Oct 15, 2021 •

edited

`linkerd check` output

olix0r commented Oct 18, 2021 •

edited

sigurdfalk commented Oct 21, 2021 •

edited

olix0r commented Oct 21, 2021 •

edited

sigurdfalk commented May 2, 2022 •

edited