Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use private network where possible in fully meshed network #364

Open
boedy opened this issue Aug 18, 2023 · 4 comments
Open

Use private network where possible in fully meshed network #364

boedy opened this issue Aug 18, 2023 · 4 comments

Comments

@boedy
Copy link

boedy commented Aug 18, 2023

Hello Kilo Community 馃憢,

I鈥檝e recently delved into Kilo after being captivated by the CNCF talk. Currently, our setup already involves a multi-cloud environment with K3S, WireGuard, and Netmaker. All nodes are interconnected in a full-mesh configuration outside of Kubernetes where we鈥檙e using Calico as our CNI.

Our infrastructure is distributed across Hetzner (Nuremberg and Falkenstein DCs), Digital Ocean, and several root servers in another data center. A significant portion of our workloads is hosted on Hetzner. With our existing full-mesh setup, all internode communication is routed through the public network. However, given the volume of traffic directed towards Hetzner nodes, I鈥檓 keen on optimizing our setup to leverage Hetzner鈥檚 private network for internode communication, which would help us sidestep egress traffic costs.

Here鈥檚 where I see Kilo potentially fitting in: While I appreciate the robustness of a full-mesh network, I鈥檓 looking for a solution where nodes, when capable, prioritize communication via the existing private network. Essentially, while maintaining the full mesh, I want nodes within the same availability zone (or with access to a shared private network) to default to the private network for communication.

From the Topology documentation, it seems Kilo can connect different cloud providers with a single link managed by two leaders. My concern here is about potential bottlenecks, especially during high traffic periods, where the leader鈥檚 NIC bandwidth would be a limiting factor for us. Hetzner also provides all nodes with 20TB of egress bandwidth. With one link we will not be able to use all the free monthly bandwidth of the other nodes.

I'm thinking each peer's endpoint should reflect the node's private IP address in case they share the same subnet.

Would something like this be possible?

@squat
Copy link
Owner

squat commented Aug 18, 2023

Hi @boedy this configuration is not possible today but is absolutely something that could technically be implemented in Kilo. It would be a slight twist on code-path for the full mesh, where Kilo checks if other nodes are in the same topological location to determine which link to use. Currently, Kilo's implementation of the full-mesh is just a simple take on the location mesh, where every node is treated like it's in its own location. All of this is to say that this is very feasible work IMO!

Do you feel up to contributing? I'd be able to point you at the codepaths that need attention. Otherwise, stay tuned and we can track my progress on the topic in this issue.

@boedy
Copy link
Author

boedy commented Aug 18, 2023

Thanks for the prompt reply @squat! I'm definitely willing to give it a shot, but haven't really contributed to a k8s Golang project before. If it's just a few code paths that need modifying I might be able to pull if off tho 馃

I could use some guidance in setting up a development environment. What has been your workflow whist working on the project?

@boedy
Copy link
Author

boedy commented Aug 18, 2023

I'm guessing the change could possibly effect this line:
https://github.com/squat/kilo/blob/37f4ea52dc54563b3ec8ff69b4d322022ef9ba12/pkg/mesh/topology.go#L186C4-L186C4

With a basic check for when a node is trying to communicate with another node in the same topological location, it should use the private IP as the WireGuard endpoint instead of the leader's public endpoint. This should also only be relevant when the granularity full is used.

@boedy
Copy link
Author

boedy commented Aug 19, 2023

Whoo I actually got it working! 馃コ here's the comparison of my changes with the main branch.

I mentioned in my previous message. I was looking for a way to determine if two IP addresses belong to the same subnet. During my tests, I initially hoped to determine the CIDR directly from the private IP address of nodes. However, I discovered that this isn't always set precisely. For instance, on Hetzner, where my private network has a subnet mask of 10.10.0.0/16, the CIDR couldn't be derived directly when I accessed a node. Using ip a, the output was inet 10.10.0.4/32 brd 10.10.0.4 scope global dynamic enp7s0.

To overcome this challenge, I leveraged the kilo.squat.ai/force-internal-ip annotation, which allowed me to determine the appropriate subnet mask. This was also necessary for my DigitalOcean nodes since the incorrect network interface was identified for the Private IP.

Moving forward, I have a few questions regarding how we should handle this feature:

  1. Activation Mechanism: How should we enable this functionality? Via specific annotations, or maybe a configuration setting?
  2. Default Behavior: Should this feature be active by default or be an opt-in?
  3. Feedback and Suggestions: The changes for now are minimal. Are there any modifications or refinements you'd recommend based on the implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants