CSI plugins sent incorrect authority headers during registration with kubelet #108254

EricRnR · 2022-02-21T15:19:17Z

What happened?

When using the CSI driver node registration sidecar container, the kubelet-registration-path parameter is set to either unix:///path/to/unix.sock or /path/to/unix.sock. One or both of these options should cause kubelet to send a valid authority header to the socket. In the former case, kubelet will fail to find the file path since it will pass the unix header in the net.Dialer target to dial. In the latter, the dialer will succeed to call the container over the socket but send an incorrect :authority pseudo header (the /path/to/unix.sock).

What did you expect to happen?

A call into the CSI container with a valid authority header, using one or either kubelet-registration-path parameter.

How can we reproduce it (as minimally and precisely as possible)?

Deploy a CSI plugin example, setting kubelet-registration-path on the node-driver-registrar sidecar container to either a unix:///path/to/unix.sock or /path/to/unix.sock. Note, a plugin may not fail in the latter case if the CSI plugin is written in a language with an http2 library that does not strictly check the authority header, however the header will still be incorrect. If using Rust as a language, the h2 library will strictly check the authority header and return a protocol error. Go does not seem to reject the invalid authority header, which is perhaps why most plugins do not notice the issue.

Anything else we need to know?

Using /path/to/unix.sock will not have a 'unix:' header. It checks for this to substitute 'localhost' as the authority here, which will not happen in this case.

Using unix:///path/to/unix.sock will get the authority substituted, but will pass the full 'unix:' header in as part of the path file. Related code can be seen here (non-nil custom dialer) and here (newGrpcConn looks like it expects no unix header based on log entry and externally supplied dialcontext). This work may have been overlapping with related work in grpc-go here and plans here, where both libraries seem to be taking responsibility for managing the authority header for unix sockets now.

Kubelet logs using the unix header:

Feb 19 14:41:11 server001.ga.racksandrails.net kubelet[238934]: I0219 14:41:11.454702  238934 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: thin-sync-csi.racksandrails.com endpoint: unix:///var/lib/kubelet/plugins/csi-thinsync/csi.sock versions: 1.0.0
Feb 19 14:41:11 server001.ga.racksandrails.net kubelet[238934]: I0219 14:41:11.454787  238934 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: thin-sync-csi.racksandrails.com at endpoint: unix:///var/lib/kubelet/plugins/csi-thinsync/csi.sock
Feb 19 14:41:11 server001.ga.racksandrails.net kubelet[238934]: W0219 14:41:11.455279  238934 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/csi-thinsync/csi.sock localhost 0xc0044710c0 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix unix:///var/lib/kubelet/plugins/csi-thinsync/csi.sock: connect: no such file or directory". Reconnecting...
Feb 19 14:41:11 server001.ga.racksandrails.net kubelet[238934]: W0219 14:41:11.455476  238934 csi_client.go:184] Error calling CSI NodeGetInfo(): rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix unix:///var/lib/kubelet/plugins/csi-thinsync/csi.sock: connect: no such file or directory"
Feb 19 14:41:11 server001.ga.racksandrails.net kubelet[238934]: E0219 14:41:11.478062  238934 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/thin-sync-csi.racksandrails.com-reg.sock" failed. No retries permitted until 2022-02-19 14:41:11.978009921 -0500 EST m=+3309.907550017 (durationBeforeRetry 500ms). Error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix unix:///var/lib/kubelet/plugins/csi-thinsync/csi.sock: connect: no such file or directory": rpc error: code = Unavailable desc = error reading from server: EOF
Feb 19 14:41:11 server001.ga.racksandrails.net kubelet[238934]: W0219 14:41:11.478162  238934 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/thin-sync-csi.racksandrails.com-reg.sock /var/lib/kubelet/plugins_registry/thin-sync-csi.racksandrails.com-reg.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/thin-sync-csi.racksandrails.com-reg.sock: connect: connection refused". Reconnecting...

Of note: the inconsistent 'Error while dialing dial unix' entries: one showing the unix: header for the path, while the registration socket shows it without. Also, the 'localhost' replacement is visible in the logs for the first (unix:-prefixed) and not the second registration notification call (non-prefixed).

node registration sidecar logs when not using the unix header:

I0221 13:07:20.377752 1 main.go:167] Running node-driver-registrar in mode=registration
I0221 13:07:20.378223 1 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0221 13:07:20.378237 1 connection.go:154] Connecting to unix:///csi/csi.sock
I0221 13:07:20.378554 1 main.go:198] Calling CSI driver to discover driver name
I0221 13:07:20.378565 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I0221 13:07:20.378568 1 connection.go:184] GRPC request: {}
I0221 13:07:20.380870 1 connection.go:186] GRPC response: {"name":"thin-sync-csi.racksandrails.com","vendor_version":"0.1"}
I0221 13:07:20.380918 1 connection.go:187] GRPC error: <nil>
I0221 13:07:20.380924 1 main.go:208] CSI driver name: "thin-sync-csi.racksandrails.com"
I0221 13:07:20.380954 1 node_register.go:53] Starting Registration Server at: /registration/thin-sync-csi.racksandrails.com-reg.sock
I0221 13:07:20.381066 1 node_register.go:62] Registration Server started at: /registration/thin-sync-csi.racksandrails.com-reg.sock
I0221 13:07:20.381106 1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0221 13:07:21.953021 1 main.go:102] Received GetInfo call: &InfoRequest{}
I0221 13:07:21.953276 1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi-thinsync/registration"
I0221 13:07:21.961349 1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: PROTOCOL_ERROR,}
E0221 13:07:21.961377 1 main.go:122] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: PROTOCOL_ERROR, restarting registration container.

Of note, kubelet notifies the container it received a protocol error. The CSI container rust logs show the matching protocol error and authority header value:

[2022-02-21T15:05:56Z DEBUG h2::server] malformed headers: malformed authority (b"/var/lib/kubelet/plugins/csi-thinsync/csi.sock"): invalid uri character
[2022-02-21T15:05:56Z DEBUG h2::codec::framed_read] received frame=Data { stream_id: StreamId(1), flags: (0x1: END_STREAM) }
[2022-02-21T15:05:56Z DEBUG h2::codec::framed_write] send frame=Reset { stream_id: StreamId(1), error_code: PROTOCOL_ERROR }

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:30:48Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:32:02Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

Bare metal

OS version

# On Linux:
$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="35 (Server Edition)"
ID=fedora
VERSION_ID=35
VERSION_CODENAME=""
PLATFORM_ID="platform:f35"
PRETTY_NAME="Fedora Linux 35 (Server Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:35"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f35/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=35
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=35
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Server Edition"
VARIANT_ID=server

$ uname -a
Linux server001.ga.racksandrails.net 5.15.14-200.fc35.x86_64 #1 SMP Tue Jan 11 16:49:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and and version (if applicable)

cri-o

Related plugins (CNI, CSI, ...) and versions (if applicable)

calico, metal-lb, bgp, applicable 1.23 versions.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-02-21T15:19:25Z

@EricRnR: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

EricRnR · 2022-02-21T15:24:28Z

/sig storage

k8s-triage-robot · 2022-05-22T15:28:44Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-06-21T16:25:36Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-07-21T16:59:05Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-07-21T16:59:21Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

EricRnR added the kind/bug Categorizes issue or PR as related to a bug. label Feb 21, 2022

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 21, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 21, 2022

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 21, 2022

This was referenced Apr 22, 2022

kubelet sending invalid grpc header during plugin registration #109081

Closed

PROTOCOL_ERROR #107093

Closed

LuJunlai1996 mentioned this issue Apr 24, 2022

Add the UNIX prefix of the CSI address. #109559

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 21, 2022

k8s-ci-robot closed this as completed Jul 21, 2022

mythi mentioned this issue Sep 20, 2022

grpc: set localhost Authority to unix client calls #112597

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI plugins sent incorrect authority headers during registration with kubelet #108254

CSI plugins sent incorrect authority headers during registration with kubelet #108254

EricRnR commented Feb 21, 2022

k8s-ci-robot commented Feb 21, 2022

EricRnR commented Feb 21, 2022

k8s-triage-robot commented May 22, 2022

k8s-triage-robot commented Jun 21, 2022

k8s-triage-robot commented Jul 21, 2022

k8s-ci-robot commented Jul 21, 2022

CSI plugins sent incorrect authority headers during registration with kubelet #108254

CSI plugins sent incorrect authority headers during registration with kubelet #108254

Comments

EricRnR commented Feb 21, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Feb 21, 2022

EricRnR commented Feb 21, 2022

k8s-triage-robot commented May 22, 2022

k8s-triage-robot commented Jun 21, 2022

k8s-triage-robot commented Jul 21, 2022

k8s-ci-robot commented Jul 21, 2022