-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: initial assignment of alias IP in hcloud (Hetzner) #8493
base: main
Are you sure you want to change the base?
Conversation
Removes the throwing of the error that occurs if an alias IP is set but there is no network ID yet. This caused the initialization to end up in an error loop. Now the interface is initialized correctly and the alias IP is set correctly on the second run. Signed-off-by: Marcel Richter <mail@mrclrchtr.de>
Prep for Talos PR 8493 integration with commented-out config for enhanced networking. Refs: siderolabs/talos#8493
Prep for Talos PR 8493 integration with commented-out config for enhanced networking. Refs: siderolabs/talos#8493
Prep for Talos PR 8493 integration with commented-out config for enhanced networking. Refs: siderolabs/talos#8493
handler.logger.Error("error assigning Hetzner Cloud floating IP to server: floating IP is not found", zap.String("vip", handler.vip), | ||
zap.Int64("device_id", handler.deviceID), zap.Int64("network_id", handler.networkID)) | ||
|
||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is not an error anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, That's the only way the bootstrap worked.
I'm not sure if there is a better way.
Edit:
And yes, it's a misconfiguration on a not available Floating IP, but I don't think that every misconfiguration leads to an error that prevents to complete the bootstrap, right? An error is still logged in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does it prevent the bootstrap?
If the Acquire fails, it should be retried (if not, it's a different bug), so once whatever thing on Hetzner side is resolved, it should acquire successfully. If we return nil
without actually attaching an IP, this leads to a misconfiguration - Talos thinks it's ok, while it's not actually working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I tried it again and I was wrong, sorry. Bootstrapping is successful.
But I still get the logspam on every control-plane-node, regardless of whether I assign an alias IP when creating the server or not:
user: warning: [2024-04-01T12:20:12.956235002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
\"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-01T12:20:14.026183002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
\"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-01T12:20:15.348476002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
\"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}
So what I can say is that when the Alias VIP is "first" initialized in Talos, the networkID
is 0 (i.e. not set). This means it skips the Alias IP
block and goes into the Floating IP
block, where of course it cannot find the floating IP because the one passed is an alias IP.
Once it is initialized, the networkID
is >0 (i.e. set) in subsequent runs and everything is working fine.
I don't know exactly where the AliasIP
and NetworkID
is set in the vip *VIP
. It comes from the operator loop
, but I don't know where it is started. Maybe that is where the problem is. Can you tell me where vip *VIP
is set? Then I'll try to analyze it further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the network ID should be set automatically from Hetzner metadata:
talos/internal/app/machined/pkg/controllers/network/operator/vip/hcloud.go
Lines 176 to 190 in e7d8041
spec.NetworkID = 0 | |
for _, privnet := range server.PrivateNet { | |
network, _, err := client.Network.GetByID(ctx, privnet.Network.ID) | |
if err != nil { | |
return fmt.Errorf("error getting network info: %w", err) | |
} | |
if network.IPRange.Contains(vip.AsSlice()) { | |
spec.NetworkID = privnet.Network.ID | |
break | |
} | |
} | |
the operator spec is not re-created unless there's some change that would force it to be re-generated
talosctl get operatorspecs -o yaml
to see what's there right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I tried it again:
the AliasIP is attached by me manually on server creation:
These are the operatorspecs:
❯ talosctl get operatorspecs -o yaml -n 10.0.1.101
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: dhcp4/eth0
version: 2
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:17:23Z
updated: 2024-04-03T21:17:26Z
spec:
operator: dhcp4
linkName: eth0
requireUp: false
dhcp4:
routeMetric: 1024
layer: platform
---
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: dhcp4/eth1
version: 1
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:17:28Z
updated: 2024-04-03T21:17:28Z
spec:
operator: dhcp4
linkName: eth1
requireUp: true
dhcp4:
routeMetric: 1024
layer: configuration
---
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: vip/eth1
version: 1
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:17:28Z
updated: 2024-04-03T21:17:28Z
spec:
operator: vip
linkName: eth1
requireUp: true
vip:
ip: 10.0.1.100
gratuitousARP: false
hcloud:
deviceID: 639543
networkID: 0
apiToken: xxxx
layer: configuration
I'll have another look at the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just tested something else:
I have commented out the vip part in the machine config. The error messages then stopped. Then I commented it in again and the result is as follows:
user: warning: [2024-04-03T21:26:51.35453309Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
"link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:52.52451809Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
"link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:53.64788609Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
"link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:54.70894809Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
"link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:56.44824509Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
"link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:57.282386188Z]: [talos] apply config request: mode auto(no_reboot)
user: warning: [2024-04-03T21:26:57.288244188Z]: [talos] node IP skipped, please use .machine.kubelet.nodeIP to provide explicit subnet for the node IP {"component": "controller-runtime", "controller": "k8s.NodeIPController", "address": "10.0.1.101"}
user: warning: [2024-04-03T21:26:57.292704188Z]: [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 3972, container kubelet)
user: warning: [2024-04-03T21:26:57.294289188Z]: [talos] removed address 10.0.1.100/32 from "eth1" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
user: warning: [2024-04-03T21:26:57.319458188Z]: [talos] updated etcd peer URLs {"component": "controller-runtime", "controller": "etcd.AdvertisedPeerController", "new_peer_urls": ["https://10.0.1.100:2380", "https://10.0.1.101:2380"], "member_id":
15664038745852706440}
user: warning: [2024-04-03T21:26:57.327769188Z]: [talos] updated etcd peer URLs {"component": "controller-runtime", "controller": "etcd.AdvertisedPeerController", "new_peer_urls": ["https://10.0.1.101:2380"], "member_id": 15664038745852706440}
user: warning: [2024-04-03T21:26:57.409243188Z]: [talos] service[kubelet](Finished): Service finished successfully
user: warning: [2024-04-03T21:26:57.410459188Z]: [talos] service[kubelet](Starting): Starting service
user: warning: [2024-04-03T21:26:57.411070188Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
user: warning: [2024-04-03T21:26:57.413742188Z]: [talos] service[kubelet](Failed): Condition failed: context canceled
user: warning: [2024-04-03T21:26:57.414781188Z]: [talos] service[kubelet](Starting): Starting service
user: warning: [2024-04-03T21:26:57.415342188Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
user: warning: [2024-04-03T21:26:57.416269188Z]: [talos] service[kubelet](Preparing): Running pre state
user: warning: [2024-04-03T21:26:57.427521188Z]: [talos] service[kubelet](Preparing): Creating service runner
user: warning: [2024-04-03T21:26:57.523997188Z]: [talos] service[kubelet](Running): Started task kubelet (PID 5374) for container kubelet
user: warning: [2024-04-03T21:26:59.443244188Z]: [talos] service[kubelet](Running): Health check successful
user: warning: [2024-04-03T21:27:07.308966188Z]: [talos] controller failed {"component": "controller-runtime", "controller": "kubeaccess.EndpointController", "error": "error getting endpoints: Get \"https://127.0.0.1:7445/api/v1/namespaces/default/endpoints/talos\":
net/http: TLS handshake timeout"}
user: warning: [2024-04-03T21:27:20.195789188Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "Get \"https://127.0.0.1:7445/api/v1/nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dcontrol-
plane-1&resourceVersion=3950&timeout=8m48s&timeoutSeconds=528&watch=true\": http2: client connection lost"}
user: warning: [2024-04-03T21:27:42.453139188Z]: [talos] apply config request: mode auto(no_reboot)
user: warning: [2024-04-03T21:27:43.711860188Z]: [talos] cleared previous Hetzner Cloud IP alias {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "vip": "10.0.1.100", "device_id": 639543, "status": "success"}
user: warning: [2024-04-03T21:27:44.448118188Z]: [talos] assigned Hetzner Cloud alias IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "vip": "10.0.1.100", "device_id": 639543, "network_id": 4075410,
"status": "success"}
user: warning: [2024-04-03T21:27:44.450763188Z]: [talos] enabled shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:27:44.453683188Z]: [talos] assigned address {"component": "controller-runtime", "controller": "network.AddressSpecController", "address": "10.0.1.100/32", "link": "eth1"}
Operatorspecs:
❯ talosctl get operatorspecs -o yaml -n 10.0.1.101
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: dhcp4/eth0
version: 2
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:17:23Z
updated: 2024-04-03T21:17:26Z
spec:
operator: dhcp4
linkName: eth0
requireUp: false
dhcp4:
routeMetric: 1024
layer: platform
---
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: dhcp4/eth1
version: 1
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:17:28Z
updated: 2024-04-03T21:17:28Z
spec:
operator: dhcp4
linkName: eth1
requireUp: true
dhcp4:
routeMetric: 1024
layer: configuration
---
node: 10.0.1.101
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: vip/eth1
version: 1
owner: network.OperatorMergeController
phase: running
created: 2024-04-03T21:27:42Z
updated: 2024-04-03T21:27:42Z
spec:
operator: vip
linkName: eth1
requireUp: true
vip:
ip: 10.0.1.100
gratuitousARP: false
hcloud:
deviceID: 639543
networkID: 4075410
apiToken: xxx
layer: configuration
Working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not hcloud experts, so we won't be able to resolve it without your (or someone else) help.
The issue is (I guess) failure to find NetworkID, but I have no idea why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as soon as I find some time, I'll take a closer look at it again. It could have been that you have a direct idea of what the problem could be. Many thanks for the response!
Is there anything that can be done or that we can talk about to make this pr mergeable? Are there any concerns or similar? |
I don't have much insight into how Hetzner works to comment more, maybe @sergelogvinov has better ideas? |
Anyways will wait for next week before merging, will trust you on that one. |
I noticed that while looking at siderolabs#8493, but I don't know if this problem actually happened in real life. If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2 ARP announce), we should not set the leader flag, as it would make the controller announce the IP, while it shouldn't do that. If this call fails, there's no matching call to de-announce on failure. The bug would show up as two nodes having same VIP assigned on the host. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
I noticed that while looking at siderolabs#8493, but I don't know if this problem actually happened in real life. If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2 ARP announce), we should not set the leader flag, as it would make the controller announce the IP, while it shouldn't do that. If this call fails, there's no matching call to de-announce on failure. The bug would show up as two nodes having same VIP assigned on the host. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
I took a few more hours to take a closer look at the problem. I definitely have a lead: The error is that by not setting the NetworkId, the operator thinks it is a As soon as I then restart a control plane (and it doesn't matter which one, whether set with AliasIP or not), the method So I suspect that an attempt is made to set the VIP or alias IP too early, causing the operator to enter an error loop and thus never call the Unfortunately, I don't have enough knowledge about the operator and how Talos works in general to be able to continue here. I am therefore dependent on help. The code in this PR apparently fixes the problem because it breaks the error loop. However, this is probably not a good fix. Hence the question: Should I open an issue to discuss further and record new insights? |
if you have a diff with the logs you added, and the logs, I'd be interested to look into that. Probably adding lots of logs on almost every line would be helpful. |
Pull Request
What? (description)
As described in #3599, assigning an alias IP in Hetzner Cloud at bootstrap of the cluster does not work.
This PR removes the throwing of the error that occurs if an alias IP is set but there is no network ID yet.
This caused the initialization to end up in an error loop.
Now the interface is initialized correctly and the alias IP is set correctly on the second run.
In addition, some logs are added that make it much easier to find out what is happening.
Why? (reasoning)
This is probably due to throwing the error and thus not initializing the interface completely.
Acceptance
Please use the following checklist:
make conformance
) -> GPG Identity failedmake fmt
)make lint
)make docs
)make unit-tests
)