Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: initial assignment of alias IP in hcloud (Hetzner) #8493

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mrclrchtr
Copy link

Pull Request

What? (description)

As described in #3599, assigning an alias IP in Hetzner Cloud at bootstrap of the cluster does not work.

This PR removes the throwing of the error that occurs if an alias IP is set but there is no network ID yet.
This caused the initialization to end up in an error loop.
Now the interface is initialized correctly and the alias IP is set correctly on the second run.

In addition, some logs are added that make it much easier to find out what is happening.

Why? (reasoning)

This is probably due to throwing the error and thus not initializing the interface completely.

Acceptance

Please use the following checklist:

  • you linked an issue (if applicable)
  • you included tests (if applicable)
  • you ran conformance (make conformance) -> GPG Identity failed
  • you formatted your code (make fmt)
  • you linted your code (make lint)
  • you generated documentation (make docs)
  • you ran unit-tests (make unit-tests)

@mrclrchtr mrclrchtr changed the title fix: fix initial assignment of alias IP in hcloud fix: initial assignment of alias IP in hcloud Mar 25, 2024
@mrclrchtr mrclrchtr changed the title fix: initial assignment of alias IP in hcloud fix: initial assignment of alias IP in hcloud (Hetzner) Mar 25, 2024
Removes the throwing of the error that occurs if an alias IP is set but there is no network ID yet. This caused the initialization to end up in an error loop. Now the interface is initialized correctly and the alias IP is set correctly on the second run.

Signed-off-by: Marcel Richter <mail@mrclrchtr.de>
mrclrchtr added a commit to hcloud-talos/terraform-hcloud-talos that referenced this pull request Mar 25, 2024
Prep for Talos PR 8493 integration with commented-out config for enhanced networking.

Refs: siderolabs/talos#8493
mrclrchtr added a commit to hcloud-talos/terraform-hcloud-talos that referenced this pull request Mar 25, 2024
Prep for Talos PR 8493 integration with commented-out config for enhanced networking.

Refs: siderolabs/talos#8493
mrclrchtr added a commit to hcloud-talos/terraform-hcloud-talos that referenced this pull request Mar 25, 2024
Prep for Talos PR 8493 integration with commented-out config for enhanced networking.

Refs: siderolabs/talos#8493
handler.logger.Error("error assigning Hetzner Cloud floating IP to server: floating IP is not found", zap.String("vip", handler.vip),
zap.Int64("device_id", handler.deviceID), zap.Int64("network_id", handler.networkID))

return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is not an error anymore?

Copy link
Author

@mrclrchtr mrclrchtr Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, That's the only way the bootstrap worked.

I'm not sure if there is a better way.

Edit:
And yes, it's a misconfiguration on a not available Floating IP, but I don't think that every misconfiguration leads to an error that prevents to complete the bootstrap, right? An error is still logged in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it prevent the bootstrap?

If the Acquire fails, it should be retried (if not, it's a different bug), so once whatever thing on Hetzner side is resolved, it should acquire successfully. If we return nil without actually attaching an IP, this leads to a misconfiguration - Talos thinks it's ok, while it's not actually working.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I tried it again and I was wrong, sorry. Bootstrapping is successful.

But I still get the logspam on every control-plane-node, regardless of whether I assign an alias IP when creating the server or not:

user: warning: [2024-04-01T12:20:12.956235002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
 \"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-01T12:20:14.026183002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
 \"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-01T12:20:15.348476002Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning
 \"10.0.1.100\" to server 45293174: floating IP is not found", "link": "eth1", "ip": "10.0.1.100"}

So what I can say is that when the Alias VIP is "first" initialized in Talos, the networkID is 0 (i.e. not set). This means it skips the Alias ​​IP block and goes into the Floating IP block, where of course it cannot find the floating IP because the one passed is an alias IP.

Once it is initialized, the networkID is >0 (i.e. set) in subsequent runs and everything is working fine.

I don't know exactly where the AliasIP and NetworkID is set in the vip *VIP. It comes from the operator loop, but I don't know where it is started. Maybe that is where the problem is. Can you tell me where vip *VIP is set? Then I'll try to analyze it further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the network ID should be set automatically from Hetzner metadata:

spec.NetworkID = 0
for _, privnet := range server.PrivateNet {
network, _, err := client.Network.GetByID(ctx, privnet.Network.ID)
if err != nil {
return fmt.Errorf("error getting network info: %w", err)
}
if network.IPRange.Contains(vip.AsSlice()) {
spec.NetworkID = privnet.Network.ID
break
}
}

the operator spec is not re-created unless there's some change that would force it to be re-generated

talosctl get operatorspecs -o yaml to see what's there right now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried it again:

the AliasIP is attached by me manually on server creation:
image

These are the operatorspecs:

❯ talosctl get operatorspecs -o yaml -n 10.0.1.101
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: dhcp4/eth0
    version: 2
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:17:23Z
    updated: 2024-04-03T21:17:26Z
spec:
    operator: dhcp4
    linkName: eth0
    requireUp: false
    dhcp4:
        routeMetric: 1024
    layer: platform
---
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: dhcp4/eth1
    version: 1
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:17:28Z
    updated: 2024-04-03T21:17:28Z
spec:
    operator: dhcp4
    linkName: eth1
    requireUp: true
    dhcp4:
        routeMetric: 1024
    layer: configuration
---
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: vip/eth1
    version: 1
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:17:28Z
    updated: 2024-04-03T21:17:28Z
spec:
    operator: vip
    linkName: eth1
    requireUp: true
    vip:
        ip: 10.0.1.100
        gratuitousARP: false
        hcloud:
            deviceID: 639543
            networkID: 0
            apiToken: xxxx
    layer: configuration

I'll have another look at the code.

Copy link
Author

@mrclrchtr mrclrchtr Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just tested something else:
I have commented out the vip part in the machine config. The error messages then stopped. Then I commented it in again and the result is as follows:

user: warning: [2024-04-03T21:26:51.35453309Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
 "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-03T21:26:52.52451809Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
 "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-03T21:26:53.64788609Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
 "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-03T21:26:54.70894809Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
 "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-03T21:26:56.44824509Z]: [talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "error assigning \"10.0.1.100\" to server 45398818: floating IP is not found",
 "link": "eth1", "ip": "10.0.1.100"}
user: warning: [2024-04-03T21:26:57.282386188Z]: [talos] apply config request: mode auto(no_reboot)
 user: warning: [2024-04-03T21:26:57.288244188Z]: [talos] node IP skipped, please use .machine.kubelet.nodeIP to provide explicit subnet for the node IP {"component": "controller-runtime", "controller": "k8s.NodeIPController", "address": "10.0.1.101"}
 user: warning: [2024-04-03T21:26:57.292704188Z]: [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 3972, container kubelet)
 user: warning: [2024-04-03T21:26:57.294289188Z]: [talos] removed address 10.0.1.100/32 from "eth1" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
 user: warning: [2024-04-03T21:26:57.319458188Z]: [talos] updated etcd peer URLs {"component": "controller-runtime", "controller": "etcd.AdvertisedPeerController", "new_peer_urls": ["https://10.0.1.100:2380", "https://10.0.1.101:2380"], "member_id":
 15664038745852706440}
 user: warning: [2024-04-03T21:26:57.327769188Z]: [talos] updated etcd peer URLs {"component": "controller-runtime", "controller": "etcd.AdvertisedPeerController", "new_peer_urls": ["https://10.0.1.101:2380"], "member_id": 15664038745852706440}
 user: warning: [2024-04-03T21:26:57.409243188Z]: [talos] service[kubelet](Finished): Service finished successfully
 user: warning: [2024-04-03T21:26:57.410459188Z]: [talos] service[kubelet](Starting): Starting service
 user: warning: [2024-04-03T21:26:57.411070188Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
 user: warning: [2024-04-03T21:26:57.413742188Z]: [talos] service[kubelet](Failed): Condition failed: context canceled
 user: warning: [2024-04-03T21:26:57.414781188Z]: [talos] service[kubelet](Starting): Starting service
 user: warning: [2024-04-03T21:26:57.415342188Z]: [talos] service[kubelet](Waiting): Waiting for service "cri" to be "up", time sync, network
 user: warning: [2024-04-03T21:26:57.416269188Z]: [talos] service[kubelet](Preparing): Running pre state
 user: warning: [2024-04-03T21:26:57.427521188Z]: [talos] service[kubelet](Preparing): Creating service runner
 user: warning: [2024-04-03T21:26:57.523997188Z]: [talos] service[kubelet](Running): Started task kubelet (PID 5374) for container kubelet
 user: warning: [2024-04-03T21:26:59.443244188Z]: [talos] service[kubelet](Running): Health check successful
 user: warning: [2024-04-03T21:27:07.308966188Z]: [talos] controller failed {"component": "controller-runtime", "controller": "kubeaccess.EndpointController", "error": "error getting endpoints: Get \"https://127.0.0.1:7445/api/v1/namespaces/default/endpoints/talos\":
 net/http: TLS handshake timeout"}
 user: warning: [2024-04-03T21:27:20.195789188Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "Get \"https://127.0.0.1:7445/api/v1/nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dcontrol-
 plane-1&resourceVersion=3950&timeout=8m48s&timeoutSeconds=528&watch=true\": http2: client connection lost"}
 user: warning: [2024-04-03T21:27:42.453139188Z]: [talos] apply config request: mode auto(no_reboot)
 user: warning: [2024-04-03T21:27:43.711860188Z]: [talos] cleared previous Hetzner Cloud IP alias {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "vip": "10.0.1.100", "device_id": 639543, "status": "success"}
 user: warning: [2024-04-03T21:27:44.448118188Z]: [talos] assigned Hetzner Cloud alias IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "vip": "10.0.1.100", "device_id": 639543, "network_id": 4075410,
 "status": "success"}
 user: warning: [2024-04-03T21:27:44.450763188Z]: [talos] enabled shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "eth1", "ip": "10.0.1.100"}
 user: warning: [2024-04-03T21:27:44.453683188Z]: [talos] assigned address {"component": "controller-runtime", "controller": "network.AddressSpecController", "address": "10.0.1.100/32", "link": "eth1"}

Operatorspecs:

❯ talosctl get operatorspecs -o yaml -n 10.0.1.101
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: dhcp4/eth0
    version: 2
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:17:23Z
    updated: 2024-04-03T21:17:26Z
spec:
    operator: dhcp4
    linkName: eth0
    requireUp: false
    dhcp4:
        routeMetric: 1024
    layer: platform
---
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: dhcp4/eth1
    version: 1
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:17:28Z
    updated: 2024-04-03T21:17:28Z
spec:
    operator: dhcp4
    linkName: eth1
    requireUp: true
    dhcp4:
        routeMetric: 1024
    layer: configuration
---
node: 10.0.1.101
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: vip/eth1
    version: 1
    owner: network.OperatorMergeController
    phase: running
    created: 2024-04-03T21:27:42Z
    updated: 2024-04-03T21:27:42Z
spec:
    operator: vip
    linkName: eth1
    requireUp: true
    vip:
        ip: 10.0.1.100
        gratuitousARP: false
        hcloud:
            deviceID: 639543
            networkID: 4075410
            apiToken: xxx
    layer: configuration

Working as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not hcloud experts, so we won't be able to resolve it without your (or someone else) help.

The issue is (I guess) failure to find NetworkID, but I have no idea why.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as soon as I find some time, I'll take a closer look at it again. It could have been that you have a direct idea of what the problem could be. Many thanks for the response!

@mrclrchtr
Copy link
Author

Is there anything that can be done or that we can talk about to make this pr mergeable? Are there any concerns or similar?

@frezbo
Copy link
Member

frezbo commented Mar 30, 2024

Is there anything that can be done or that we can talk about to make this pr mergeable? Are there any concerns or similar?

I don't have much insight into how Hetzner works to comment more, maybe @sergelogvinov has better ideas?

@frezbo
Copy link
Member

frezbo commented Mar 30, 2024

Is there anything that can be done or that we can talk about to make this pr mergeable? Are there any concerns or similar?

I don't have much insight into how Hetzner works to comment more, maybe @sergelogvinov has better ideas?

Anyways will wait for next week before merging, will trust you on that one.

@smira smira mentioned this pull request Apr 3, 2024
smira added a commit to smira/talos that referenced this pull request Apr 3, 2024
I noticed that while looking at siderolabs#8493, but I don't know if this problem
actually happened in real life.

If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2
ARP announce), we should not set the leader flag, as it would make the
controller announce the IP, while it shouldn't do that.

If this call fails, there's no matching call to de-announce on failure.

The bug would show up as two nodes having same VIP assigned on the host.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
smira added a commit to smira/talos that referenced this pull request Apr 3, 2024
I noticed that while looking at siderolabs#8493, but I don't know if this problem
actually happened in real life.

If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2
ARP announce), we should not set the leader flag, as it would make the
controller announce the IP, while it shouldn't do that.

If this call fails, there's no matching call to de-announce on failure.

The bug would show up as two nodes having same VIP assigned on the host.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
@mrclrchtr
Copy link
Author

I took a few more hours to take a closer look at the problem.

I definitely have a lead:
I added some logs in the GetNetworkAndDeviceIDs method. The strange thing is that this method does not seem to be called during the first bootstrap. I could not see the logs. So the NetworkID is not set and the error loops start in all CP's

The error is that by not setting the NetworkId, the operator thinks it is a Public VIP and not Alias IP. Of course, it cannot find this in the API call.

As soon as I then restart a control plane (and it doesn't matter which one, whether set with AliasIP or not), the method GetNetworkAndDeviceIDs is called, the NetworkID is set correctly and all error log loops in all control planes stop.

So I suspect that an attempt is made to set the VIP or alias IP too early, causing the operator to enter an error loop and thus never call the GetNetworkAndDeviceIDs method.

Unfortunately, I don't have enough knowledge about the operator and how Talos works in general to be able to continue here. I am therefore dependent on help.

The code in this PR apparently fixes the problem because it breaks the error loop. However, this is probably not a good fix. Hence the question: Should I open an issue to discuss further and record new insights?

@smira
Copy link
Member

smira commented May 30, 2024

I added some logs in the GetNetworkAndDeviceIDs method. The strange thing is that this method does not seem to be called during the first bootstrap. I could not see the logs. So the NetworkID is not set and the error loops start in all CP's

if you have a diff with the logs you added, and the logs, I'd be interested to look into that.

Probably adding lots of logs on almost every line would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants