Daemonset crashloopback in openshift #404

itmwiw · 2023-04-11T17:27:58Z

Hello,
I have an Openshift Cluster and I try to use hetznercloud csi-drive. However, all daemonset's pods are in CrashLoopBackOff state. Here's the logs:

[pod/hcloud-csi-node-45xqq/hcloud-csi-driver] level=error ts=2023-04-11T14:33:12.085976239Z msg="failed to fetch server ID from metadata service" err="Get \"http://169.254.169.254/hetzner/v1/metadata/instance-id\": dial tcp 169.254.169.254:80: connect: connection refused"

I guess this is related to what is described in here #143.
This issue was closed because version 1.6.0 attempts to use the environment variable HCLOUD_SERVER_ID or KUBE_NODE_NAME with a call to HCloudClient before falling back to the MetadataClient.
However v2.2.0 doesn't do that anymore, so I guess the issue is back.
Can you help me on this?
Regards,
Tarik

apricote · 2023-04-12T06:54:11Z

Hey, this was changed in #269, so we can remove access to the Hetzner Cloud API from the daemon set. We would prefer to keep the daemon set ("node" binary) as small as possible, so adding back access to the API is not what we want.

@samcday Do you have an idea how we can solve this for OpenShift where access to the metadata service is blocked?

apricote · 2023-04-12T06:57:12Z

Oh, forgot to mention. The Server ID and Location, which are the two fields retrieved from the Metadata Service are used in the response to NodeGetInfo:

csi-driver/driver/node.go

Lines 194 to 205 in cbb7750

    
           func (s *NodeService) NodeGetInfo(context.Context, *proto.NodeGetInfoRequest) (*proto.NodeGetInfoResponse, error) { 
        
           	resp := &proto.NodeGetInfoResponse{ 
        
           		NodeId:            s.serverID, 
        
           		MaxVolumesPerNode: MaxVolumesPerNode, 
        
           		AccessibleTopology: &proto.Topology{ 
        
           			Segments: map[string]string{ 
        
           				TopologySegmentLocation: s.serverLocation, 
        
           			}, 
        
           		}, 
        
           	} 
        
           	return resp, nil 
        
           }

samcday · 2023-04-12T09:05:50Z

Hm. Tricky one. My original hope was to use k8s Node metadata as source of truth for this, thus tying csi-driver to hccm. But of course that violates the CSI abstraction and won't work for other container orchestrators.

Ultimately, the only way for us to determine this information from a particular node, without assuming any access to a control plane / orchestrator API of any kind, means we can only fetch this information from the metadata service, or fallback to statically provided information.

... Or we just add back the HCLOUD_TOKEN requirement for the node binary, so that it can fetch this info from the API. That would be a bummer from a purist technical point of view, but maybe it's the only way we can keep the CSI driver running reliably (and reasonably ergonomically!) across multiple orchestrators.

samcday · 2023-04-12T09:13:09Z

One other somewhat hacky idea: we could do the metadata API lookup in a small initContainer that uses hostNetwork: true and then pass that information along to the main (not host-networking) process.

apricote · 2023-04-12T09:40:03Z

One other somewhat hacky idea: we could do the metadata API lookup in a small initContainer that uses hostNetwork: true and then pass that information along to the main (not host-networking) process.

Perhaps this is something that can be done only for Openshift through the Helm Chart?

samcday · 2023-04-12T12:40:43Z

Perhaps this is something that can be done only for Openshift through the Helm Chart?

Yes, that sounds good 👍 Or even more generally: just a thing that you can opt into through values.yaml: helm install csi-driver --set initMetadataLookup=true or somesuch.

That said, it might just be better to always do it that way and keep the number of different deployment modes to a minimum. With such an approach, the node binary could remove all notion of HC API or metadata service, and require that all necessary metadata/topology info is injected through env. Some of this env comes from downward API, the rest comes from this proposed init container.

alrf · 2023-04-19T11:56:37Z

I have the same issue in Openshift.

alrf · 2023-05-05T13:10:50Z

I solved it in v2.3.2 using Topology=false here:

csi-driver/deploy/kubernetes/hcloud-csi.yml

Line 225 in dfe6183

- --feature-gates=Topology=true

and added hostNetwork: true in DaemonSet on line 298:

csi-driver/deploy/kubernetes/hcloud-csi.yml

Line 298 in dfe6183

affinity:

github-actions · 2023-08-04T12:46:09Z

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions bot added the Stale label Aug 4, 2023

jooola added pinned and removed Stale labels Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daemonset crashloopback in openshift #404

Daemonset crashloopback in openshift #404

itmwiw commented Apr 11, 2023

apricote commented Apr 12, 2023

apricote commented Apr 12, 2023

samcday commented Apr 12, 2023 •

edited

samcday commented Apr 12, 2023

apricote commented Apr 12, 2023

samcday commented Apr 12, 2023 •

edited

alrf commented Apr 19, 2023

alrf commented May 5, 2023

github-actions bot commented Aug 4, 2023

Daemonset crashloopback in openshift #404

Daemonset crashloopback in openshift #404

Comments

itmwiw commented Apr 11, 2023

apricote commented Apr 12, 2023

apricote commented Apr 12, 2023

samcday commented Apr 12, 2023 • edited

samcday commented Apr 12, 2023

apricote commented Apr 12, 2023

samcday commented Apr 12, 2023 • edited

alrf commented Apr 19, 2023

alrf commented May 5, 2023

github-actions bot commented Aug 4, 2023

samcday commented Apr 12, 2023 •

edited

samcday commented Apr 12, 2023 •

edited