Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from 2.10.2 -> 2.11.0 - Kube Client "error trying to connect: tls handshake eof" in Policy Controller #7098

Closed
sigurdfalk opened this issue Oct 15, 2021 · 17 comments

Comments

@sigurdfalk
Copy link

sigurdfalk commented Oct 15, 2021

Bug Report

What is the issue?

We are having issues upgrading from 2.10.2 -> 2.11.0. It seems to be related to the new Policy Controller. The controller fails starting up, apparently when the Kube Client is trying to watch resources: "kube::client: failed with error error trying to connect: tls handshake eof". Logs:

2021-10-06T11:36:02.462254Z  INFO grpc{addr=0.0.0.0:8090 cluster_networks=[10.0.0.0/8, 100.64.0.0/10, 172.16.0.0/12, 192.168.0.0/16]}: linkerd_policy_controller: gRPC server listening addr=0.0.0.0:8090
2021-10-06T11:36:02.462275Z  INFO serve{addr=0.0.0.0:9990}: linkerd_policy_controller::admin: HTTP admin server listening addr=0.0.0.0:9990
2021-10-06T11:36:02.462913Z  INFO linkerd_policy_controller: Admission controller server listening addr=0.0.0.0:9443
2021-10-06T11:36:02.469200Z ERROR servers: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469221Z  INFO servers: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:02.469455Z ERROR pods: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469476Z  INFO pods: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:02.469513Z ERROR serverauthorizations: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469525Z  INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:03.470663Z  INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Restarting
2021-10-06T11:36:03.470763Z  INFO pods: linkerd_policy_controller_k8s_api::watch: Restarting
2021-10-06T11:36:03.470804Z  INFO servers: linkerd_policy_controller_k8s_api::watch: Restarting

We are able to track the request to our API Server which seem to respond with status 200, but indicating that "Connection closed early":

{
   "kind":"Event",
   "apiVersion":"audit.k8s.io/v1",
   "level":"Request",
   "auditID":"78cdd15d-3197-4437-99ae-c44504bea33b",
   "stage":"ResponseStarted",
   "requestURI":"/api/v1/endpoints?allowWatchBookmarks=true\u0026resourceVersion=477527\u0026timeout=5m14s\u0026timeoutSeconds=314\u0026watch=true",
   "verb":"watch",
   "user":{
      "username":"system:serviceaccount:linkerd:linkerd-destination",
      "uid":"edc0ceeb-f8df-4e52-b73b-65b2a420ee21",
      "groups":[
         "system:serviceaccounts",
         "system:serviceaccounts:linkerd",
         "system:authenticated"
      ],
      "extra":{
         "authentication.kubernetes.io/pod-name":[
            "linkerd-destination-9b7f77f5f-7dbj8"
         ],
         "authentication.kubernetes.io/pod-uid":[
            "51ee23ea-5b2c-46dd-bda6-2e998b8fb78c"
         ]
      }""
   },
   "sourceIPs":[
      "***"
   ],
   "userAgent":"controller/v0.0.0 (linux/amd64) kubernetes/$Format",
   "objectRef":{
      "resource":"endpoints",
      "apiVersion":"v1"
   },
   "responseStatus":{
      "metadata":{
         
      },
      "status":"Success",
      "message":"Connection closed early",
      "code":200
   },
   "requestReceivedTimestamp":"2021-10-06T11:30:55.194103Z",
   "stageTimestamp":"2021-10-06T11:36:09.196098Z",
   "annotations":{
      "authorization.k8s.io/decision":"allow",
      "authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"linkerd-linkerd-destination\" of ClusterRole \"linkerd-linkerd-destination\" to ServiceAccount \"linkerd-destination/linkerd\""
   }""
}

linkerd check output

We didn't save the output from linkerd check and have rolled back to 2.10.2 now. However, when we ran the check while having the issues, it reported all ✅

Environment

  • Kubernetes Version: 1.21.2
  • Cluster Environment: AKS
  • Host OS: Linux
  • Linkerd version: 2.11.0

Additional context

Linker 2.10.2 has been running in the same cluster without any issues. We had some trouble creating a certificate for the policy controller using Cert Manager as the Rust TLS http client used apparently dont support ECDSA. Solved this by switching to RSA as discussed in this thread on Linkerd Slack.

@sigurdfalk sigurdfalk changed the title Upgrading from 2.10.2 -> 2.11.0 - Kub Client "error trying to connect: tls handshake eof" in Policy Controller Upgrading from 2.10.2 -> 2.11.0 - Kube Client "error trying to connect: tls handshake eof" in Policy Controller Oct 15, 2021
@olix0r
Copy link
Member

olix0r commented Oct 18, 2021

Thanks @sigurdfalk. This sounds similar to #7011. In this issue, we observed that the policy controller only works with a strict subset of ECDSA algorithms specified in the TLSv1.3 RFC:

          /* ECDSA algorithms */
          ecdsa_secp256r1_sha256(0x0403),
          ecdsa_secp384r1_sha384(0x0503),
          ecdsa_secp521r1_sha512(0x0603),

Can you share an example of a certificate that did not work?

@sigurdfalk
Copy link
Author

@olix0r Thanks for the response.

We figured that the Rust tls http client that is used for the Policy Controller don't support ECDSA as the following cert did not work (see thread in Linkerd Slack):

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: linkerd-policy-validator
  namespace: linkerd
spec:
  secretName: linkerd-policy-validator-k8s-tls
  duration: 24h
  renewBefore: 1h
  issuerRef:
    name: webhook-issuer
    kind: Issuer
  commonName: linkerd-policy-validator.linkerd.svc
  dnsNames:
  - linkerd-policy-validator.linkerd.svc
  isCA: false
  privateKey:
    algorithm: ECDSA
  usages:
  - server auth

We then tried using RSA which was accepted by the Policy Controller, but then we started experiencing the error mentioned in this issue. We created the cert like this which is the one that did not work:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: linkerd-policy-validator
  namespace: ${local.linkerd_namespace}
spec:
  secretName: linkerd-policy-validator-k8s-tls
  duration: 24h
  renewBefore: 1h
  issuerRef:
    name: webhook-issuer
    kind: Issuer
  commonName: linkerd-policy-validator.linkerd.svc
  dnsNames:
  - linkerd-policy-validator.linkerd.svc
  isCA: false
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048
  usages:
  - server auth

@olix0r
Copy link
Member

olix0r commented Oct 19, 2021

Interesting!

I think this issue is related to rustls/rustls#332 and kube-rs/kube#542. We see:

:; k get secrets test-cert -o json | jq -r '.data["tls.key"] | @base64d'
-----BEGIN EC PRIVATE KEY-----
...
-----END EC PRIVATE KEY-----

But rust's crypto libraries do not currently support PEM-formatted ECDSA private keys. See more discussion here.

For the time being we'll have to require that webhook credentials are RSA (or that ECDSA keys are provided in in DER format, though I doubt cert-manager supports that out of the box). We'll probably want to followup on djc's suggestion to implement a standalone PEM decoder for these cases.

@olix0r
Copy link
Member

olix0r commented Oct 19, 2021

Actually...

If I configure my cert-manager certificate with:

  privateKey:
    algorithm: ECDSA
    encoding: PKCS8

I get credentials with:

:; k get secrets test-cert -o json | jq -r '.data["tls.key"] | @base64d'  |head -n 1
-----BEGIN PRIVATE KEY-----

Which might work. We'll give this a try later, or if you can try it and report back, that might save us some time ;)

@olix0r
Copy link
Member

olix0r commented Oct 19, 2021

From some very brief testing, this seems to work. We should update the docs at https://linkerd.io/2.11/tasks/automatically-rotating-webhook-tls-credentials/ (https://github.com/linkerd/website) -- which doesn't even document the policy controller webhook config at the moment -- to reflect this.

@olix0r
Copy link
Member

olix0r commented Oct 20, 2021

Docs updated linkerd/website#1221

@olix0r
Copy link
Member

olix0r commented Oct 20, 2021

@sigurdfalk The updated docs at https://linkerd.io/2.11/tasks/automatically-rotating-webhook-tls-credentials/#issuing-certificates-and-writing-them-to-secrets should get you a working cluster with 2.11. It would be great if you confirm that this all works as expected (i.e. by adding encoding: PKCS8); but after that I think we can close this issue :)

@sigurdfalk
Copy link
Author

@olix0r Thats great, thank you! I'm gonna try verifying this tomorrow 🙏🏻

@sigurdfalk
Copy link
Author

sigurdfalk commented Oct 21, 2021

@olix0r The policy controller now accepts the cert with ECDSA, however, we are still seing the same errors 🧐

I did a fresh install of v2.10.2 and then tried to upgrade to v2.11.0 with encoding: PKCS8 in the certificate. I still think this is really strange as all other components of Linkerd seems to work as expected and are able to communicate with the API Server.

New dump of errors:

{"timestamp":"2021-10-21T08:08:47.991233Z","level":"INFO","fields":{"message":"gRPC server listening","addr":"0.0.0.0:8090"},"target":"linkerd_policy_controller","spans":[{"addr":"0.0.0.0:8090","cluster_networks":"[10.0.0.0/8, 100.64.0.0/10, 172.16.0.0/12, 192.168.0.0/16]","name":"grpc"}]}
{"timestamp":"2021-10-21T08:08:47.991257Z","level":"INFO","fields":{"message":"HTTP admin server listening","addr":"0.0.0.0:9990"},"target":"linkerd_policy_controller::admin","spans":[{"addr":"0.0.0.0:9990","name":"serve"}]}
{"timestamp":"2021-10-21T08:08:47.991365Z","level":"INFO","fields":{"message":"Admission controller server listening","addr":"0.0.0.0:9443"},"target":"linkerd_policy_controller"}
{"timestamp":"2021-10-21T08:08:48.000102Z","level":"ERROR","fields":{"message":"failed with error error trying to connect: tls handshake eof"},"target":"kube::client","spans":[{"name":"pods"}]}
{"timestamp":"2021-10-21T08:08:48.000235Z","level":"INFO","fields":{"message":"Failed","error":"failed to perform initial object list: HyperError: error trying to connect: tls handshake eof"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:48.000384Z","level":"ERROR","fields":{"message":"failed with error error trying to connect: tls handshake eof"},"target":"kube::client","spans":[{"name":"servers"}]}
{"timestamp":"2021-10-21T08:08:48.000448Z","level":"INFO","fields":{"message":"Failed","error":"failed to perform initial object list: HyperError: error trying to connect: tls handshake eof"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:48.000550Z","level":"ERROR","fields":{"message":"failed with error error trying to connect: tls handshake eof"},"target":"kube::client","spans":[{"name":"serverauthorizations"}]}
{"timestamp":"2021-10-21T08:08:48.000576Z","level":"INFO","fields":{"message":"Failed","error":"failed to perform initial object list: HyperError: error trying to connect: tls handshake eof"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:49.000943Z","level":"INFO","fields":{"message":"Restarting"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:49.001056Z","level":"INFO","fields":{"message":"Restarting"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:49.002210Z","level":"INFO","fields":{"message":"Restarting"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}

And we still see the requests coming in to our API Server, which responds with status 200 OK. Here is a line from the API server log tracing back to the policy controller requesting /apis/split.smi-spec.io/v1alpha1/trafficsplits:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"359d181f-47bc-4bc9-ba3a-d2b52c9099bb","stage":"ResponseComplete","requestURI":"/apis/split.smi-spec.io/v1alpha1/trafficsplits?limit=500\u0026resourceVersion=0","verb":"list","user":{"username":"system:serviceaccount:linkerd:linkerd-destination","uid":"4820cbb9-bb93-4315-8121-1a83dd678e0d","groups":["system:serviceaccounts","system:serviceaccounts:linkerd","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["linkerd-destination-756758f5b-xfn6f"],"authentication.kubernetes.io/pod-uid":["0fe069f2-6624-445d-82de-28720cc0b957"]}​},"sourceIPs":["***"],"userAgent":"controller/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"trafficsplits","apiGroup":"split.smi-spec.io","apiVersion":"v1alpha1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2021-10-21T08:08:09.650095Z","stageTimestamp":"2021-10-21T08:08:09.651179Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"linkerd-linkerd-destination\" of ClusterRole \"linkerd-linkerd-destination\" to ServiceAccount \"linkerd-destination/linkerd\""}​}

Not sure if relevant, but we have a Azure Firewall between the cluster and the API Server. But this has never been an issue before.

@olix0r
Copy link
Member

olix0r commented Oct 21, 2021

@sigurdfalk the new policy controller is written in Rust using kube-rs, whereas the older components are implemented in Go with client-go. So we're still working through some quirks of using the new library. We really appreciate the helpful feedback, though. I'm sorry that this upgrade hasn't been seamless -- hopefully we'll figure out how to catch this class of issue better.

Can you share the output of:

:; kubectl get secret $(kubectl get sa default -o json |jq  -r '.secrets[0].name') -o json |jq -r '.data["ca.crt"] | @base64d'   

I'm curious about the parameters used in API server's CA certificate... We've definitely seen Linkerd 2.11 working fine in AKS...

And we still see the requests coming in to our API Server, which responds with status 200 OK. Here is a line from the API server log tracing back to the policy controller requesting /apis/split.smi-spec.io/v1alpha1/trafficsplits:

This is probably pedantic, but this is actually from the destination controller, not the policy controller (though both run in the same pod). So, I agree that probably confirms that it's not a firewall issue.

You could try installing 2.11 with policyController.logLevel=trace -- that should dump a lot more diagnostics in your logs, at least. Though, I'm not sure if that will give us a clearer error...

@sigurdfalk
Copy link
Author

@olix0r We really appreciate Linkerd and are happy to keep debugging this 🙂

Cert output is:

-----BEGIN CERTIFICATE-----
MIIE6TCCAtGgAwIBAgIRALsWkwnXbn6R2S/lNa+u6b4wDQYJKoZIhvcNAQELBQAw
DTELMAkGA1UEAxMCY2EwIBcNMjExMDIwMTIxNzMyWhgPMjA1MTEwMjAxMjI3MzJa
MA0xCzAJBgNVBAMTAmNhMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA
pumk/WT2PBil2FpdmJfeOnLVEJ7JbnvRsBNkgH9gJYX3PenWYw2C21GeayfxaKEl
UEq8n2hPIma2oCXyiDocFElALdx/Rerb19Kq7myVGY/AVhMOUBRxHnAcNoT7DWJ8
NZTYRn+BZ/LPp/RfGTTqffhUM/c+p/wy07ylfrrZZoE4fPOT8h9QV8vMR7AuGaQW
Gfl1qS4srYBpdssHCPzcadiuiZZnZPLyObBUA5BxOoqEZ7ZMRa0WSJ1Fcxy8DwYF
79NSCLDu+fHy4rpynW6Ps4xjS0xqvMD9A6cmdXgtEbPsSiX7+275+omB54QA8KCi
dBfqYjx7PvL4awTFwLwXQwiUpDvUnbdx/5LESsazjuDuJT3aTknhrzUToBfGfpk9
uYdWzV5dh0K69tth1v+bLRrOU/QgbzObKwHOnalp7GnWtkdtgBDA7eYWCdDHUwSd
ke0Q+aOlIh4P/gy3PpviJHJ3WHwfOrl9mSZxjRg6zPDpQLsuxwm9k0WOBv77nJYO
KK00cqYxC7QXvDn0Ymi2FkSzNE+NJZm6YGSx45ukEQvhiGrd3LL8gYQR6/i1f7NW
01u8djXBXJf+d4zqIky92AQyc3m0chck6IGv7DNzqSTs1g9U9cgwX+kCtYpwSR4R
gwezWANMpv/9Pt0V41PnHzeRuS+hsigbTdYfaUykWScCAwEAAaNCMEAwDgYDVR0P
AQH/BAQDAgKkMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYEFCdfVS0vRvZJdJ9L
issw2JdFtfCpMA0GCSqGSIb3DQEBCwUAA4ICAQACi+vkUX/SA4wiZJBs2NKbIY9E
Vw9gG0iKyq28gQUZU1pnPPC1YbZEDuOq49MS4Mj8OrxwwGCwPfEh4vDm/PodPukT
Fn6KVhmct5/6xa7PdM8gcjUqRDc7awd9cSoxbwXkdDSBl2wTjoXcpxP4wQ9zqpdX
Sejo4+R2EeNkImZ0Xsyj2wk2benT1lbbWt9UjYFIfTaWOPPlCEGYpAbPsWFRS5YQ
W990JfzXPmx10j7xeOK2Am8RFzpMxvS0nZeUo7leR6L28r0MQvWKwQmkHAuv4Eej
NrunHdbjLWD6Rfn1lQ9yAJ6PJrxJPbFtQ3iSyowBzP6RQrX1Q1E+mRSJESaQg9dl
s1MThhRojC5TLmKKELqAOvAXGHRm1MhceQdrd5MS2XJ8Qedjjs7BecP+b0+AwggY
qLXkT4GRdEaUZRjRWCeQBmIYeVyKRpvr84O+p2x8JytH4RL7iqs7tC4ESf76xdG1
ZIiAEIYP1SklzE2Elw0v3uLMqu7USw3AyPdIMC39+uI6drLIfs1pV+txIalZ99sB
CokFYvQZoSWJPhB4mOqMjqRykvVKRRiw0dsl7TwuT3IGgclj+pDJ9lk4cwHoLJqH
PMP88tZijYTDnF7WQN8ztdw1XttwepdrCNB18X+Hqy1g9FUEDzo+Tfh8eBqO9bwH
C0jYoxWTnYTI9oC7vw==
-----END CERTIFICATE-----

The logs got really big when enabling TRACE, so a bit hard for me lacking knowledge og the Rust application to make too much sense of them. But I'll keep digging. I dumped all logs from application start in the attached file (1 second window):

1634894398_118505.txt

@olix0r
Copy link
Member

olix0r commented Oct 22, 2021

@sigurdfalk Thanks! that certificate looks basically normal, as do the logs. We'll do some digging on a working cluster and see if we can come up with any differences.

@sigurdfalk
Copy link
Author

@olix0r did you have some luck in your testing?

@olix0r
Copy link
Member

olix0r commented Nov 2, 2021

@sigurdfalk Sorry, I don't think we have any leads on this issue yet. It's still on our radar, though.

@stale
Copy link

stale bot commented Jan 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@sigurdfalk
Copy link
Author

sigurdfalk commented May 2, 2022

@olix0r we no longer have this issue with 2.11.2. Running perfectly i our clusters as we speak :) So can close this issue from our part

@olix0r
Copy link
Member

olix0r commented May 2, 2022

@sigurdfalk Excellent. Thanks for confirming!

@olix0r olix0r closed this as completed May 2, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants