Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libnet/ipam: Lazily sub-divide pools into subnets #43033

Closed
wants to merge 1 commit into from

Conversation

akerouanton
Copy link
Member

@akerouanton akerouanton commented Nov 19, 2021

I bundled these two commits together as the second one is easier to test with changes done in the first one. I can still break this PR into two parts and rewrite the unit test of the second commit to test it with the current master branch. Let me know if you prefer that (eg. if you want to cherry-pick this change into older releases or because it's a big diff to review).

- 1st commit

This commit resolves #40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

- 2nd commit

Prior to this change, IPv6 subnets generation was broken because of a
bitwise shift overflowing uint64.

To solve this issue, addIntToIP has been changed to take an addend and
computes two ordinals applied to respectively the lower 64 bits and the
upper 64 bits of IPv6 addresses.

Nothing change for IPv4 (ie. upperOrdinal is always 0).

- How to verify it

I added unit tests to confirm these changes are okay. Moreover, I tried to run dockerd with following config:

{
	"log-level": "info",
	"ipv6": true,
	"experimental": true,
	"ip6tables": true,
	"fixed-cidr-v6": "2a01:e34:xxxx:yyyy::/64",
	"default-address-pools": [
		{ "base": "172.17.0.0/16", "size": 16 },
		{ "base": "172.18.0.0/16", "size": 16 },
		{ "base": "2a01:e34:xxxx:yyyy::/64", "size": 120 }
	]
}

When starting dockerd v20.10.10 with this config, the daemon crashes with this panic message message: runtime error: makeslice: cap out of range. When starting a patched version of dockerd, it doesn't crash.

- Description for the changelog

@akerouanton akerouanton changed the title Generate split subnets on-demand and properly shift bytes of IPv6 subnets @akerouanton Generate split subnets on-demand and properly shift bytes of IPv6 subnets Nov 19, 2021
@Xyaren
Copy link

Xyaren commented Nov 25, 2021

Hey, awesome that somebody is giving ipv6 some love 👍
Any chance this is fixing #41438 aswell ?

@akerouanton
Copy link
Member Author

@Xyaren I still need a bit of time to test it properly but yeah, it should fix #41438 as (I believe) this is a consequence of the uint64 overflow 😉 Thanks for mentioning this issue 🙂

@akerouanton akerouanton marked this pull request as ready for review November 29, 2021 00:33
@akerouanton
Copy link
Member Author

@thaJeztah PTAL 🙂 (I don't know who else I should ping. Maybe arkodg?)

@Xyaren
Copy link

Xyaren commented Apr 29, 2022

So.. about 5 month later...
Is there anyone able to look at this please ?

@thaJeztah thaJeztah added this to the 22.06.0 milestone May 31, 2022
@akerouanton akerouanton force-pushed the fix-40275-42801 branch 2 times, most recently from a453813 to 2812533 Compare June 3, 2022 18:13
@thaJeztah
Copy link
Member

@akerouanton looks like some failures in a test (may need some extra options in it);

=== Failed
=== FAIL: github.com/docker/docker/daemon/config TestDaemonConfigurationMergeDefaultAddressPools/empty_config_file (0.00s)
    config_test.go:155: assertion failed: cannot handle unexported field at {[]*ipamutils.NetworkToSplit}[0].ipVersion:
        	"github.com/docker/docker/libnetwork/ipamutils".NetworkToSplit
        consider using a custom Comparer; if you control the implementation of type, you can also consider using an Exporter, AllowUnexported, or cmpopts.IgnoreUnexported
    --- FAIL: TestDaemonConfigurationMergeDefaultAddressPools/empty_config_file (0.00s)
=== FAIL: github.com/docker/docker/daemon/config TestDaemonConfigurationMergeDefaultAddressPools/config_file (0.00s)
    config_test.go:165: assertion failed: cannot handle unexported field at {[]*ipamutils.NetworkToSplit}[0].ipVersion:
        	"github.com/docker/docker/libnetwork/ipamutils".NetworkToSplit
        consider using a custom Comparer; if you control the implementation of type, you can also consider using an Exporter, AllowUnexported, or cmpopts.IgnoreUnexported
    --- FAIL: TestDaemonConfigurationMergeDefaultAddressPools/config_file (0.00s)

@akerouanton
Copy link
Member Author

@thaJeztah Ah, looks like a bad rebase. Should be fixed now.

@thaJeztah thaJeztah modified the milestones: 22.06.0, v-next Aug 18, 2022
@max-wittig
Copy link

max-wittig commented Sep 7, 2022

@thaJeztah Any chance to have another look at this? Would be really nice to have that fixed!

@max-wittig
Copy link

@thaJeztah @cpuguy83 Friendly ping 😄 . We've added IPv6 networking support to the gitlab-runner, but we have no way to use it as IPv6 networking in Docker is still broken, when used in this way.

Is there any way I can support the efforts here?

libnetwork/ipamutils/utils.go Outdated Show resolved Hide resolved
libnetwork/ipamutils/utils.go Outdated Show resolved Hide resolved
libnetwork/ipamutils/utils.go Outdated Show resolved Hide resolved
@max-wittig
Copy link

@akerouanton Would you mind to implement these small requested changes? It would really help us to use Docker with IPv6 🥺

@akerouanton akerouanton force-pushed the fix-40275-42801 branch 2 times, most recently from 9dce7c6 to 1fe1d92 Compare December 23, 2022 13:24
@thaJeztah
Copy link
Member

@akerouanton Looks like the linter caught a typo;

libnetwork/ipamutils/utils.go:160:19: `primarly` is a misspelling of `primarily` (misspell)
// This method is primarly used to duplicate NetworkSplitter in tests.
                  ^

And there's one test-failure (haven't looked if its a legit failure, or a test that needs updating);

=== FAIL: github.com/docker/docker/libnetwork/ipamutils TestInitPoolWithIPv6 (0.00s)
    utils_test.go:154: assertion failed: <nil> (string) != fd00:0:ffff:ffff:ffff:ffff::/96 (string)
    utils_test.go:158: assertion failed: <nil> (string) != fd00:0:ffff:ffff:ffff:ffff::/96 (string)

@akerouanton
Copy link
Member Author

@thaJeztah The failing test looks legit but I didn't had time to debug it. I'll fix that later and ping you when it's done 😉

@akerouanton akerouanton force-pushed the fix-40275-42801 branch 2 times, most recently from 37c7895 to a2f5fed Compare December 27, 2022 12:46
@akerouanton
Copy link
Member Author

@thaJeztah I'm not sure why the integration-cli job is failing. I don't reproduce this error on my computer. Could you take a look please?

@thaJeztah
Copy link
Member

Hm... not sure; would have to check why it fails. I can give it a kick (in case there's some flakiness in that test, or (e.g.) some other daemon was still running).

Copied the error below (in case we need to revisit)

=== Failed
=== FAIL: amd64.integration-cli TestDockerNetworkSuite/TestDockerNetworkIPAMMultipleNetworks (15.30s)
    docker_cli_network_unix_test.go:588: assertion failed: 
        Command:  /usr/local/cli/docker network create test3
        ExitCode: 1
        Error:    exit status 1
        Stdout:   
        Stderr:   Error response from daemon: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network
        
        
        Failures:
        ExitCode was 1 expected 0
        Expected no error
    docker_cli_network_unix_test.go:46: [d17053f569bde] daemon is not started
    --- FAIL: TestDockerNetworkSuite/TestDockerNetworkIPAMMultipleNetworks (15.30s)

=== FAIL: amd64.integration-cli TestDockerNetworkSuite (769.49s)

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in the process of cleaning up the IPAM packages (#44827 + WIP) to get rid of global mutable state and dead code. (Fun fact: ipam.NewAllocator is only ever called with nil datastores! All the datastore-related code can be thrown out.) Let's put this PR on hold until it can be rebased on top of the IPAM cleanup & refactor work, as that will open up a lot more flexibility to implement on-demand subnet splitting more cleanly.

I have spotted a much less invasive short-term mitigation for the IPv6 memory exhaustion issue which would be feasible to backport to 23.0: set a cap on the number of networks ipamutils.splitNetworks will return. 2^24 networks per address space should be well beyond what could be allocated without running into other limitations, and small enough to not exhaust all memory.

Comment on lines 31 to 34
// resetCollections indicates whether the predefined NetworkSplitter can be reset when the end of these
// collections is reached. This happens when the whole collection has been consumed but some of its subnets were
// allocated and then released.
resetCollections map[string]map[ipamutils.IPVersion]bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for resetCollections makes me think that NetworkSplitter as currently designed is not the right abstraction for allocating subnets. Interestingly, Allocator uses a pretty memory-efficient data structure (bitseq.Handle) for keeping track of address allocations within a network; keeping track of allocations within a contiguous range of networks is such a similar problem that the same strategy could work. And as luck would have it, I've been working on refactoring and documenting that data type! #44864

libnetwork/ipam/allocator.go Outdated Show resolved Hide resolved
libnetwork/ipamutils/utils.go Outdated Show resolved Hide resolved
libnetwork/ipamutils/utils.go Outdated Show resolved Hide resolved
libnetwork/ipamutils/utils.go Outdated Show resolved Hide resolved
libnetwork/ipamutils/utils.go Outdated Show resolved Hide resolved
libnetwork/ipam/allocator.go Outdated Show resolved Hide resolved
libnetwork/ipamutils/utils.go Outdated Show resolved Hide resolved
libnetwork/ipamutils/utils.go Outdated Show resolved Hide resolved
@TobTobXX
Copy link

Is this PR still active? It's still causing massive headaches in #42801 and #41438. And it seems as it only requires rebasing?

@akerouanton akerouanton changed the title Generate split subnets on-demand and properly shift bytes of IPv6 subnets libnet/ipam: Lazily sub-divide pools into subnets Apr 11, 2023
predefined: pdf,
subnets: map[netip.Prefix]*PoolData{},
predefined: map[ipamutils.IPVersion]*ipamutils.Subnetter{
ipamutils.IPv4: predefined.V4(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy with that: an address space with mixed address families feels weird. On the other hand ReleasePool currently relies on the pool's string identifier to determine what address space is involved.

@corhere @thaJeztah I can make a follow-up PR to make addrSpace handle a single address family if you think it'd be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, an address space with mixed address families does feel weird. I like the idea of having four addrSpace structs for each of (local, global) x (IPv4, IPv6). Especially since it could be coded such that no IPv4/IPv6 enumeration is needed.

)

// IPVersion represents the type of IP address.
type IPVersion int
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corhere @thaJeztah Is it okay if I introduce this new IPVersion type? Otherwise I can remove it and work on unifying all IPVersion types used across libnetwork.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code can be structured such that no IPVersion constants are needed.

A new Subnetter structure is added to lazily sub-divide an address pool
into subnets. This fixes moby#40275.

Prior to this change, the list of NetworkToSplit was eagerly split into
smaller subnets when ipamutils package was loaded, when
ConfigGlobalScopeDefaultNetworks was called or when the function
SetDefaultIPAddressPool from the default IPAM driver was called. In the
latter case, if the list of NetworkToSplit contained an IPv6 prefix,
eagerly enumerating all subnets could eat all the available memory. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
Subnetter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
the Subnetter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available.

Also, the Subnetter leverages the newly introduced ipbits package, which
handles IPv6 addresses correctly. Before this commit, a bitwise shift
was overflowing uint64 and thus only a single subnet could be enumerated
from an IPv6 prefix. This fixes moby#42801.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with all the "follow-up PR" talk? Is there some time pressure to get large pools fixed ASAP?

// Subnetter is an iterator lazily enumerating subnets from its NetworkToSplit. IPv6
// subnets can't be enumerated eagerly, otherwise they might end up taking way too much
// memory (eg. fd00::/8 split into /96). See https://github.com/moby/moby/issues/40275.
// Subnetter is not safe for concurrent use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predefined: pdf,
subnets: map[netip.Prefix]*PoolData{},
predefined: map[ipamutils.IPVersion]*ipamutils.Subnetter{
ipamutils.IPv4: predefined.V4(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, an address space with mixed address families does feel weird. I like the idea of having four addrSpace structs for each of (local, global) x (IPv4, IPv6). Especially since it could be coded such that no IPv4/IPv6 enumeration is needed.

var ErrNoMoreSubnet = errors.New("no more subnet available")

// NextSubnet returns a new subnet from its NetworkToSplit, or an error if all subnets have already been enumerated.
func (s *Subnetter) NextSubnet() (netip.Prefix, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (s *Subnetter) NextSubnet() (netip.Prefix, error) {
func (s *Subnetter) NextSubnet() (subnet netip.Prefix, ok bool) {

Comment on lines +43 to +45
// reset indicates whether the predefined subnetters can be reset when all subnets
// have been enumerated but some have been released.
reset map[ipamutils.IPVersion]bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like this whole concept of holding the mutable iterator state with the collection being iterated over, and having this "can be reset" flag to paper over the issues stemming from that design. For one, there is no clean way to iterate over a permutation of the collection without repeating values. A concrete toy example might better illustrate what I'm trying to say.

Consider the following iterator, where ^ signifies the current position.

[ 1 2 3 4 5 ]
    ^

With an API similar to Subnetter, attempting to iterate over the whole collection starting from the current position would yield the sequence 2 3 4 5 1 2 3 4 5. If one could assume the values in the collection are all unique, the iterating code could remember that 2 was the first value iterated over and stop once it iterates over 2 again. But that would leave the current position incremented one past where iteration "stopped," a fencepost error.

I think a better design would be to separate the iterator from the collection. Make the collection immutable and the iterator ephemeral, yielding exactly one permutation of the collection. Here's a sketch of what that could look like:

// Cursor returns a cursor positioned at the start of the pool.
func (SubnetPool) Cursor() SubnetCursor

// Next returns a cursor moved forward by one position.
func (SubnetCursor) Advance() SubnetCursor
// Iter returns a new iterator over a permutation of the pool starting from the cursor position.
func (SubnetCursor) Iter() *SubnetIter

// Next advances the iterator cursor by one and returns the subnet at that position.
// If advancing the cursor would position it at a subnet
// which has already been returned by a prior call to Next,
// the cursor is not advanced and ok=false is returned.
func (*SubnetIter) Next() (v netip.Prefix, ok bool)
// Cursor returns a cursor for the current position.
func (SubnetIter) Cursor() SubnetCursor

The idea is to use the cursor in the same way as predefinedStartIndex: remember the last place iteration terminated so the next round of iteration knows where to start and stop.

)

// IPVersion represents the type of IP address.
type IPVersion int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code can be structured such that no IPVersion constants are needed.

@akerouanton
Copy link
Member Author

So, after some more tests, it turns out this PR makes clear that either the way we check for overlapping subnets is quite naive (it works well for IPv4, but not for IPv6) and/or the current Subnetter is the wrong abstraction.

For instance, if a user specify a default address pool with the base CIDR 10.0.0.0/8 and a subnet size of 30 in their daemon.json, and set bip config param as 10.0.0.1/8, when the user want to create a new custom network the daemon has to iterate over all 2^(30-8) subnets (~ 4 millions) before returning an error. This takes ~ 0.5s (on my computer) and the issue is not really noticeable. Unfortunately, with IPv6 the worst case is so much bigger than it'd become unusable: with a /64 address pool split into /120 subnets, the current implementation has to iterate over 2^56 possible subnets (~ 7 * 10^16) before returning an error.

$ cat /etc/docker/daemon.json
{
	"log-level": "info",
	"bip": "10.0.0.1/8",
	"fixed-cidr": "10.0.0.0/8",
	"default-address-pools": [
		{ "base": "10.0.0.0/8", "size": 30 }
	]
}
$ docker network create testnet1

Now, another case where that same bug happens. Two default address pools are defined: the first one yielding a single subnet and matching the bip/fixed-cidr params, and another, random, address pool (eg. 10.0.0.0/8 with size 24). Now if the user creates a network with a subnet matching the 2nd address pool CIDR, and then try to create a 3rd network with no subnet specified, all subnets from the 2nd address pool are enumerated before (*addrSpace).allocatePredefinedPool() returns an error.

$ cat /etc/docker/daemon.json
{
	"bip": "172.19.0.1/24",
	"fixed-cidr": "172.19.0.0/24",
	"default-address-pools": [
		{ "base": "172.19.0.0/24", "size": 24 },
		{ "base": "10.0.0.0/8", "size": 24 }
	]
}
$ docker network create --subnet=10.0.0.0/8 testnet1
$ docker network create testnet2

Since addrSpace is locked during the "sometimes almost endless" loop (in (*addrSpace).allocatePredefinedPool()), the network controller can neither request nor release addresses and subnets in the mean time.

I can only come up with these two options to fix this issue:

  • Disallow bip/fixed-cidr to overlap with one of the default address pools, and disallow custom networks with a user defined subnet to collide with one of the default address pools. We might break things if we do so, so probably not a good idea.
  • Make the Subnetter aware of what subnets are currently in use.
    • Unfortunately, just moving its internal offset to ignore custom subnets won't be enough since the whole subnetter might be reset (and it's currently always reset at start up because of the way the default bridge is restored). Once reset, we're back to square one.
    • The other option is to merge Subnetter into addrSpace, to make it able to yield new subnets based on default address pools configured and user-supplied custom subnets. This poses the question of what data structure to use for that purpose. I can't think of anything else than LC/LPC/radix tries, but I'm wondering if a simpler data structure could help solve this issue.

@polarathene
Copy link
Contributor

polarathene commented Apr 17, 2023

Earlier response

For instance, if a user specify a default address pool with the base CIDR 10.0.0.0/8 and a subnet size of 30 in their daemon.json, and set bip config param as 10.0.0.1/8, when the user want to create a new custom network the daemon has to iterate over all 2^(30-8) subnets (~ 4 millions) before returning an error.

Realistically that many subnets is not going to be used. Current defaults AFAIK has a much smaller amount of subnets available in the pool? (IIRC, this is poorly documented though)

The issue here though is:

  • Address pool with /8 base to be subdivided into /30 subnets.
  • Any other network setting an explicit subnet that overlaps (the /8).

Only one of those configurations is valid to support at a time, and could be detected when starting the daemon?

  • If the /8 is already reserved for the docker0 bridge, then the address pool base of /8 or smaller would be invalid before considering size.
  • If address pools come first, you've already reserved the entire /8 subnet, again size can be ignored since it's not relevant, assigning a /8 for docker0 should fail.

Same logic applies for IPv6.


Now if the user creates a network with a subnet matching the 2nd address pool CIDR, and then try to create a 3rd network with no subnet specified, all subnets from the 2nd address pool are enumerated before (*addrSpace).allocatePredefinedPool() returns an error.

This is a bit more interesting. But same concern, it's not that pragmatic to iterate through unallocated pools like that when realistically the pools already convey reservation of the base that is split into subnets of size.

This check is only important when creating a new network correct? Networks that are created persist their config with the assigned subnet, and pragmatically, how large does the number of assigned subnets get? It isn't likely to be that expensive to track with a hashmap is it?

EDIT: I might have misunderstood, is this about:

  • docker network create --subnet 10.0.0.0/8 (not 10.0.0.0/24?)
  • docker network create (implicitly tries to assign a 10.0.0.0/24?)

Doesn't really change my feedback though either way.


Disallow bip/fixed-cidr to overlap with one of the default address pools, and disallow custom networks with a user defined subnet to collide with one of the default address pools.
We might break things if we do so, so probably not a good idea.

Why is it not viable to track what subnets are actually in use? The pool is just defining blocks of address space that can be assigned a subnet of size when one is not explicitly requested.

The other option is to merge Subnetter into addrSpace, to make it able to yield new subnets based on default address pools configured and user-supplied custom subnets.
This poses the question of what data structure to use for that purpose.
I can't think of anything else than LC/LPC/radix tries, but I'm wondering if a simpler data structure could help solve this issue.

Seems we're on the same page 😀 👍

Kubernetes uses bitmaps apparently. Roaring compressed bitmaps were adopted as an improvement. There is some recent related work on k8s, so I think they're still preferring that.

There's also a nice article on investigating a few data structures for working with IPs (Apnic), although I'm not sure how well any handle range queries to detect overlapping subnets 😅


I put something together with some rust code, data structure isn't simpler but I deferred that to existing libraries.

impl SubnetTree for SubnetManager<IPv6> {
    fn add(self: &mut Self, cidr_block: &str) {
      let subnet = Self::subnet_from_range(cidr_block);

      if let Some(overlapped) = self.tree.search(subnet).next() {
        println!("{} overlaps IP range with subnet {}", cidr_block, overlapped.data);
      } else {
        self.tree.insert(subnet, cidr_block.to_string());
        println!("Added subnet {}", cidr_block);
      }
    }
  }

Main logic is mostly that, and can be used like this:

fn main() {
  let mut subnets = SubnetManager::<IPv6>::new();
  let existing = ["fd00:feed:face:cafe::/64", "fd00:feed:face:beef::/64"];

  // These add without issue:
  for cidr_block in existing {
    subnets.add(cidr_block);
  }

  // These will fail due to overlap:
  subnets.add("fd00:feed:face:beef::/48");
  subnets.add("fd00:feed:face::/48");
}

Works with IPv4 as well, although I just used two different trees for that. Each tree sets dimensions used by the number of octets (4 for IPv4, 16 for IPv6), which AFAIK simplifies the range queries (I've not done any measurements).


UPDATE Nov 2023: Looks like a PR similar to the overlap check in my above example was implemented here: #46755

@akerouanton
Copy link
Member Author

I'm going to close this PR in favor of #47768. As commented here, this implementation is a dead end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants