Per-interface sysctls #47639

robmry · 2024-03-27T12:27:11Z

Description

Background

In 26.0, we removed use of the OCI prestart hook that was used by the daemon to get a callback from the runtime, during container task creation, to SetKey.

With the hook in place, the initial set of network interfaces were created in the container's namespace before task creation. Once it was removed, interfaces were only moved into the network namespace after task creation (before the task was started).

The prestart hook runs before --sysctl options have been applied. We reverted that change, because it meant --sysctl for an interface-specific setting (like --sysctl net.ipv6.conf.eth0.accept_ra=2), prevented the container from starting because the interface did not exist when the runtime tried to apply the sysctl setting.

The immediate driver for the change was to be able to inspect the container to determine whether it had IPv6 enabled, after sysctls had been applied. Then, use that to set up /etc/hosts with or without IPv6 hosts - and, in follow-up PRs, to decide whether to allocate IPv6 addresses etc. But, eliminating the SetKey "reexec" was desirable in any case.

(Discussed in the networking maintainers sync call on 26th March - @corhere, @akerouanton, @cpuguy83.)

Objectives

Provide a more reliable way to set interface-specific sysctls, without having to rely on interface naming:
- eth0 is the name of the first interface that gets set up, but the order isn't predictable on restart for a container with multiple interfaces.
(Make it possible to set a sysctl on an interface created after the container, by a network connect.)
Enable removal of the SetKey prestart hook.

Proposal

Add a sysctl option to the "advanced syntax" of --network. For example:
- --network name=mynet,sysctl=ipv6.conf.forwarding=1,sysctl=ipv6.conf.accept_ra=2
- In the CLI, check the option's value has the form "k=v", pass it to the engine API's create endpoint as a []string in EndpointSettings.
  - Existing --sysctl options are passed as a map, but ordering might be important and should be preserved.
- Expand the setting's value, assuming net. and the interface's generated name...
  - For example:ipv6.conf.forwarding -> net.ipv6.conf.eth0.forwarding.
  - The idea is to:
    - Avoid potential issues with arbitrary non-endpoint-specific sysctls being set per-endpoint (for example, enabling or disabling IPv6 for the whole container using net.ipv6.conf.all.disable_ipv6 would be difficult to deal with).
    - Avoid the issue of interface naming. As a separate change we could offer a per-endpoint ifname option, but we'd also need to deal with conflicting names, and unspecified names.
  - But, it's not obvious and won't be anyone's first guess at the option's usage - if a full sysctl name is provided, we'll need the API to return an error message that clearly explains the problem... "you asked for 'net.ipv6.conf.eth0.forwarding', did you mean 'ipv6.conf.forwarding'?"
    - Is this enough reason to use the full sysctl name? Could generate an error if it's not net.something, and if the interface name is all or default. But, the real interface name is more tricky - could either ignore the given name as a placeholder and use the generated name (but that might be surprising), or require the interface name to be IFNAME or similar (but then, back to the problem that the option's usage isn't clear)?
- Apply these sysctls after moving the interface into the container's namespace, before bringing it up.
Once we re-remove the SetKey prestart hook, try to avoid breaking existing commands...
- In the API handler, migrate net.*.*.eth0 settings from --sysctl to the new field in EndpointSettings, for the first --network.
- Deprecate the use of per-endpoint settings in --sysctl, add a warning to the create response, disable the migration in some future API version.
Add a --sysctl option to docker network connect.
- Same shortened format as above, ipv6.conf.accept_ra=2.

Matching changes will be needed in 'compose'. (I don't think 'build' has provision for sysctl settings.)

Other notes ...

Interfaces are moved into the network namespace by libnetwork, not by its network drivers. We may want to give network drivers a veto on specific sysctl settings that would break their operation. But, there's no immediate requirement.
Existing network drivers only create a single interface, if a future driver creates more than one we'll need to extend the syntax of the new sysctl options.

The text was updated successfully, but these errors were encountered:

thaJeztah · 2024-03-27T13:27:57Z

Avoid potential issues with arbitrary non-endpoint-specific sysctls being set per-endpoint (for example, enabling or disabling IPv6 for the whole container using net.ipv6.conf.all.disable_ipv6 would be difficult to deal with).

This is the part where things may become fuzzy. ISTR we either have a hard-coded list of sysctls that were known to be namespaced (net.xxx being one of them), but this was already a blurry line (as some of them depended on kernel version), so not 100% sure if we indeed codified that constrained, or only documented it. I don't have a good answer here though; ideally we wouldn't have to make assumptions (things may change!), but at the same time there could be value in providing useful feedback to the user (this may not result in what you expect it to result in!).

robmry · 2024-03-27T15:35:14Z

This is the part where things may become fuzzy. ISTR we either have a hard-coded list of sysctls that were known to be namespaced (net.xxx being one of them),

Ah - yes ... https://docs.docker.com/reference/cli/docker/container/run/#sysctl - so, I guess we're safe in terms of namespacing if we limit the new options to net..

Beyond that, and checking the setting is interface-specific - at some later point we may decide to error on settings that definitely break things for a particular driver ... but that's about making the footgun hard to aim, we probably can't remove it completely because we won't know the user's intent.

If we find we need to allow access to more than just net.*.*.<ifname>, we can do that in future - as @corhere pointed out, that's much easier than trying to add restrictions once we've allowed more access than we should.

So, I'd probably leave it at that - rather than try to build in setting-specific warnings which, as you say, will be a moving target?

akerouanton · 2024-03-28T10:42:42Z

Existing network drivers only create a single interface, if a future driver creates more than one we'll need to extend the syntax of the new sysctl options.

I think the best UX for such hypothetical driver would be a mechanism that allows to name each iface independently. Then, sysctls would have to be specified with the name component set.

If we want to make sure we can support this in the future, then I think it's best to: 1. keep the name component in the sysctl (no discoverability issue) ; 2. require users to name their interfaces when they want to use net. sysctls. Since that ifname parameter would be new, we can easily forbid all and default to be used. We might need to forbid ethX too. (And I'm even wondering whether this new ifname could be used to make ordering more predictable and container-local.)

One downside with this, we're moving the "multi-iface" problem from the sysctl parameter to the ifname parameter. For such hypothetical driver, I think ifname would be labeled and those labels would be opaque to everything but the driver (eg. --network=name=testnet,driver=hypothetical,ifname=ns=...,ifname=ew=...).

thaJeztah · 2024-03-28T12:05:17Z

So, I'd probably leave it at that - rather than try to build in setting-specific warnings which, as you say, will be a moving target?

Yes, it's the "moving target" that may be problematic. If these are things we can say for sure (guaranteed) to be non-functional, we could error, but if not, maybe a warning (at best?). And even there, there's a bit of a risk of it becoming a sliding-slope if it means we need to continue codifying too much knowledge about Linux and sysctls (based on assumptions we can currently make).

That said; providing a useful warning to the user also has a place so "win some", "loose some" I guess.

robmry · 2024-03-28T15:17:37Z

I think the best UX for such hypothetical driver would be a mechanism that allows to name each iface independently. Then, sysctls would have to be specified with the name component set.

Being able to name interfaces in the container might be a good feature - but it'd be nice to keep it separate from this change. Tying the two together for the sake of hypothetical driver might be wasted effort.

But, it's worth knowing we're not backing ourselves into a corner ... so, we'd need a way to refer to the hypothetical driver's interface roles. If the driver has interfaces with roles "ns" and "ew" (which could be instantiated as "eth0"/"eth1", "eth1"/"eth0", "ns0"/"ew0", or anything else if we add an 'ifname' option) - something like this would do the trick, and we'd be able extend the syntax to accommodate it later, if needed? ...
--network name=mynet,sysctl=ns:ipv6.conf.accept_ra=2,sysctl=ew:ipv6.conf.accept_ra=0

robmry · 2024-03-28T15:23:10Z

If these are things we can say for sure (guaranteed) to be non-functional, we could error, but if not, maybe a warning (at best?). And even there, there's a bit of a risk of it becoming a sliding-slope if it means we need to continue codifying too much knowledge about Linux and sysctls (based on assumptions we can currently make).

Indeed - as a starting point, I suggest we just steer clear. If we find people are getting themselves into trouble with any particular options, or there's a driver that will always break if a setting is modified but people modify it anyway ... we can add an error/warning at that point.

The sysctls will be applied by libnetwork when it moves the interface into the network namespace but, if we find we need a way for drivers to object to settings - we can pass the list of sysctls to the driver (as a generic param / label or whatever it's called) ... for the driver to inspect if it's interested, and raise an error if there's a problem.

robmry · 2024-03-29T17:31:35Z

Maybe netifsysctl would be a better name than sysctl for the new --network option? (Or similar, maybe netifctl to cut down on the typing?!)

But - something that describes its function a bit more accurately, makes it less surprising that the sysctl name is mangled (and to give us a bit of room for manoeuvre if we ever think of a reason to allow more than 'net....*', unlikely as that seems).

robmry added status/0-triage kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny labels Mar 27, 2024

robmry self-assigned this Mar 27, 2024

thaJeztah added the area/networking label Mar 27, 2024

robmry linked a pull request Apr 5, 2024 that will close this issue

Per-interface sysctls #47686

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-interface sysctls #47639

Per-interface sysctls #47639

robmry commented Mar 27, 2024 •

edited

thaJeztah commented Mar 27, 2024

robmry commented Mar 27, 2024

akerouanton commented Mar 28, 2024 •

edited

thaJeztah commented Mar 28, 2024

robmry commented Mar 28, 2024

robmry commented Mar 28, 2024

robmry commented Mar 29, 2024

Per-interface sysctls #47639

Per-interface sysctls #47639

Comments

robmry commented Mar 27, 2024 • edited

Description

thaJeztah commented Mar 27, 2024

robmry commented Mar 27, 2024

akerouanton commented Mar 28, 2024 • edited

thaJeztah commented Mar 28, 2024

robmry commented Mar 28, 2024

robmry commented Mar 28, 2024

robmry commented Mar 29, 2024

robmry commented Mar 27, 2024 •

edited

akerouanton commented Mar 28, 2024 •

edited