Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico and HCC #641

Open
medicol69 opened this issue May 8, 2024 · 13 comments
Open

Calico and HCC #641

medicol69 opened this issue May 8, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@medicol69
Copy link

TL;DR

This is more of an inquiry, since it's not that clear from the documentation, does the hetzner cloud controller work with the Calico CNI when using the private interfaces on Hetzner? Thanks

Expected behavior

this is an inquiry on the documentation.

@medicol69 medicol69 added the enhancement New feature or request label May 8, 2024
@apricote
Copy link
Member

When you use the private networks from Hetzner Cloud with hcloud-cloud-controller-manager and enable the routes-controller (default), then you should be able to use Calico without any additional overlay networks. You can configure this in Calico with CALICO_NETWORKING_BACKEND=none

I have never personally tested this configuration though.

@simonostendorf
Copy link
Contributor

I am also interested in this topic, if you have any knowledge @medicol69 please let me now :)

@DeprecatedLuke
Copy link

DeprecatedLuke commented Jun 2, 2024

Yes, it works fine with calico. To run a quick test use hetzner-k3s.

Important warning when running cloud together when baremetal with private networking. Calico requires a /24 vlan address per node which means when you're creating a subnet make sure the vlan subnet is at minimum a /23 (1 nodes max) or ideally /17 (127 nodes max) allocating first half to cloud instances and the second half to baremetal instances.

@medicol69
Copy link
Author

thanks, but I don't think that the hetzner private network interfaces are stable enough to use them in production. If anyone got them to work and give out an example of how to use it in prod I'm all ears.

@DeprecatedLuke
Copy link

I am currently running it just fine with calico and even have ceph working over vlan with pretty good performance. You cannot advertise nodeip with internal so define hostendpoint instead for metrics and etcd to be protected. Load balancers also require you to use public net in this case.

@simonostendorf
Copy link
Contributor

simonostendorf commented Jun 3, 2024

I am using calico without encapsulation and hccm with routes enabled. Calico uses BPF and replaces kube-proxy.

I think this works well, but I haven't tested it enough to be 100% sure.

If you have any feedback on this configuration, I would love to discuss it :)

calico-tigera-operator-values.yaml

installation:
  cni:
    type: Calico
    ipam:
      type: HostLocal # use podCIDR assigned by kube-controller-manager, that is also used by route-controller in hcloud-cloud-controller-manager
  calicoNetwork:
    bgp: Enabled
    linuxDataplane: BPF
    hostPorts: Disabled
    ipPools:
      - name: default-ipv4
        cidr: 10.0.0.0/16
        encapsulation: None
        blockSize: 24
        natOutgoing: Enabled
        nodeSelector: all()
defaultFelixConfiguration:
  enabled: true
  bpfEnabled: true
  bpfExternalServiceMode: DSR
  bpfKubeProxyIptablesCleanupEnabled: true
kubernetesServiceEndpoint:
  host: api.my-cluster.domain.tld
  port: 6443

@DeprecatedLuke
Copy link

I am not sure why, but when using hetzner-k3s the internal network works just fine, however, a manually bootstrapped cluster has an issue with the cloud controller where it does not recognize the internal ip address so it never gets the taint removed and the labels added.

I spent few hours trying to figure out why without being able to find any difference between the two configurations. My only guess is that it is some internal order of configuration where the metadata/private network endpoints are not being parsed in order.

So to recap: allocate at least /16 vlan range and do not use the hcloud controller (will not be able to use the load balancer or resolve labels automatically).

@simonostendorf
Copy link
Contributor

simonostendorf commented Jun 4, 2024

I am not sure why, but when using hetzner-k3s the internal network works just fine, however, a manually bootstrapped cluster has an issue with the cloud controller where it does not recognize the internal ip address so it never gets the taint removed and the labels added.

What kubernetes version do you use? Kubernetes 1.29 had a change that the node ip will be left empty if cloud-provider is set to external and --node-ip is not set manually. Maybe this is the case here.

From CHANGELOG-1.29: kubelet , when using --cloud-provider=external, will now initialize the node addresses with the value of --node-ip , if it exists, or waits for the cloud provider to assign the addresses. (https://github.com/kubernetes/kubernetes/pull/121028, [@aojea](https://github.com/aojea))

@medicol69
Copy link
Author

I am currently running it just fine with calico and even have ceph working over vlan with pretty good performance. You cannot advertise nodeip with internal so define hostendpoint instead for metrics and etcd to be protected. Load balancers also require you to use public net in this case.

I was thinking on private networking on hetzner, if anyone is doing that in production please share your config, and what are your experiences.

@simonostendorf
Copy link
Contributor

I was thinking on private networking on hetzner, if anyone is doing that in production please share your config, and what are your experiences.

I am currently testing this. You can see my calico values above. HCCM configuration is normal with networks enabled.

@DeprecatedLuke
Copy link

DeprecatedLuke commented Jun 4, 2024

I am not sure why, but when using hetzner-k3s the internal network works just fine, however, a manually bootstrapped cluster has an issue with the cloud controller where it does not recognize the internal ip address so it never gets the taint removed and the labels added.

What kubernetes version do you use? Kubernetes 1.29 had a change that the node ip will be left empty if cloud-provider is set to external and --node-ip is not set manually. Maybe this is the case here.

From CHANGELOG-1.29: kubelet , when using --cloud-provider=external, will now initialize the node addresses with the value of --node-ip , if it exists, or waits for the cloud provider to assign the addresses. (https://github.com/kubernetes/kubernetes/pull/121028, [@aojea](https://github.com/aojea))

I tried both 1.29 and 1.30, here's my init script:

k3sup install --host $SERVER_HOST --ip $PUBLIC_IP --user root --ssh-key=~/.ssh/id_ed25519 --cluster --local-path ~/.kube/config --merge --context $CLUSTER --no-extras --k3s-channel latest --k3s-extra-args "\
--disable local-storage \
--disable metrics-server \
--disable-cloud-controller \
--kubelet-arg='provider-id=hcloud://$PROVIDER_ID' \
--kubelet-arg='cloud-provider=external' \
--flannel-backend=none \
--disable-network-policy \
--write-kubeconfig-mode=644 \
--cluster-domain=$CLUSTER_DOMAIN \
--cluster-cidr=$CLUSTER_CIDR \
--service-cidr=$CLUSTER_SERVICE_CIDR \
--cluster-dns=$CLUSTER_DNS \
--node-name=$SERVER_HOSTNAME \
--node-ip=$PRIVATE_IP \
--node-external-ip=$PUBLIC_IP \
--tls-san=$CLUSTER_LB \
--tls-san=$PRIVATE_IP \
--tls-san=$PUBLIC_IP \
--tls-san=$CLUSTER_DOMAIN \
--node-taint=CriticalAddonsOnly=true:NoExecute \
--etcd-expose-metrics='true' \
--kube-controller-manager-arg='bind-address=0.0.0.0' \
--kube-proxy-arg='metrics-bind-address=0.0.0.0' \
--kube-scheduler-arg='bind-address=0.0.0.0' \
" --print-command

EDIT: added node-ip=$PRIVATE_IP, the configuration before is what I am currently using to get around the issue.

I am currently running it just fine with calico and even have ceph working over vlan with pretty good performance. You cannot advertise nodeip with internal so define hostendpoint instead for metrics and etcd to be protected. Load balancers also require you to use public net in this case.

I was thinking on private networking on hetzner, if anyone is doing that in production please share your config, and what are your experiences.

Yes, it does work including networking and routes out of the box when using hetzner-k3s tool. But I had issues with getting HCCM to recognize the nodes when defining an internal ip as the node network when attempting to bootstrap the cluster manually. However, using the public ip works fine (and routes are still created for internal communication). Robot does not support networking from HCCM.

@simonostendorf
Copy link
Contributor

Yes, it does work including networking and routes out of the box when using hetzner-k3s tool. But I had issues with getting HCCM to recognize the nodes when defining an internal ip as the node network when attempting to bootstrap the cluster manually. However, using the public ip works fine (and routes are still created for internal communication). Robot does not support networking from HCCM.

I am using kubeadm only on hcloud nodes (currently no dedicated / robot nodes, maybe i will add them later) and this works fine.

@DeprecatedLuke
Copy link

DeprecatedLuke commented Jun 4, 2024

Alright, here's the full guide to replicate the issue:
init_master.sh

#!/bin/bash

CLUSTER=$1
CLUSTER_DOMAIN=$2
SERVER_HOST=$3
CLUSTER_PRIVATE_NET=$4
CLUSTER_CIDR=$5
CLUSTER_SERVICE_CIDR=$6
CLUSTER_DNS=$7
CLUSTER_LB=$8

PUBLIC_IP=$(ssh $SERVER_HOST "curl checkip.amazonaws.com")
PRIVATE_IP=$(ssh $SERVER_HOST "ip route get $CLUSTER_PRIVATE_NET | awk '{print \$7}'")
PROVIDER_ID=$(ssh $SERVER_HOST "curl http://169.254.169.254/hetzner/v1/metadata/instance-id")

echo "Public IP: $PUBLIC_IP Private IP: $PRIVATE_IP"

kubectl config delete-cluster $CLUSTER
kubectl config delete-user $CLUSTER

SERVER_HOSTNAME=$(echo $SERVER_HOST | cut -d'.' -f1)

ssh -y $SERVER_HOST "curl https://packages.hetzner.com/hcloud/deb/hc-utils_0.0.4-1_all.deb -o /tmp/hc-utils_0.0.3-1_all.deb -s && apt -y install /tmp/hc-utils_0.0.3-1_all.deb"

k3sup install --host $SERVER_HOST --ip $PUBLIC_IP --user root --ssh-key=~/.ssh/id_ed25519 --cluster --local-path ~/.kube/config --merge --context $CLUSTER --no-extras --k3s-channel latest --k3s-extra-args "\
--disable local-storage \
--disable metrics-server \
--disable-cloud-controller \
--kubelet-arg='provider-id=hcloud://$PROVIDER_ID' \
--kubelet-arg='cloud-provider=external' \
--flannel-backend=none \
--disable-network-policy \
--write-kubeconfig-mode=644 \
--cluster-domain=$CLUSTER_DOMAIN \
--cluster-cidr=$CLUSTER_CIDR \
--service-cidr=$CLUSTER_SERVICE_CIDR \
--cluster-dns=$CLUSTER_DNS \
--node-name=$SERVER_HOSTNAME \
--node-ip=$PRIVATE_IP \
--node-external-ip=$PUBLIC_IP \
--tls-san=$CLUSTER_LB \
--tls-san=$PRIVATE_IP \
--tls-san=$PUBLIC_IP \
--tls-san=$CLUSTER_DOMAIN \
--node-taint=CriticalAddonsOnly=true:NoExecute \
--etcd-expose-metrics='true' \
--kube-controller-manager-arg='bind-address=0.0.0.0' \
--kube-proxy-arg='metrics-bind-address=0.0.0.0' \
--kube-scheduler-arg='bind-address=0.0.0.0' \
" --print-command

kubectl config set-cluster $CLUSTER --server=https://$CLUSTER_LB:6443
k3sup ready --context $CLUSTER <- will fail since no CNI

bash init_master.sh test-cluster cluster.local IP_ADDRESS 10.224.0.0 10.222.0.0/16 10.223.0.0/16 10.223.0.10 IP_ADDRESS

kubectl config set-context test-cluster

Install calico:
helm repo add tiegra https://docs.tigera.io/calico/charts
helm repo update tiegra
helm install cni tiegra/tigera-operator -n tiegra-operator

Create HCCM secret with the network cidr and hcloud token.

Install hcloud:
helm repo add hcloud https://charts.hetzner.cloud
helm repo update hcloud
helm install hccm hcloud/hcloud-cloud-controller-manager -n kube-system --values values.yaml

nodeSelector:
  node-role.kubernetes.io/control-plane: "true"

Observe the following error:

error syncing '*node*': failed to get node modifiers from cloud provider: provided node ip for node "*node*" is not valid: failed to get node address from cloud provider that matches ip: 10.224.0.2, requeuing

edit: the actual name doesn't matter for the hostname since providerid is specified, usually the hostname would be a domain matching the name of the node and the calico step is optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants