New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libnet/ipam: Lazily sub-divide pools into subnets #43033
Conversation
Hey, awesome that somebody is giving ipv6 some love 👍 |
@thaJeztah PTAL 🙂 (I don't know who else I should ping. Maybe arkodg?) |
So.. about 5 month later... |
a453813
to
2812533
Compare
@akerouanton looks like some failures in a test (may need some extra options in it);
|
2812533
to
1f47715
Compare
@thaJeztah Ah, looks like a bad rebase. Should be fixed now. |
@thaJeztah Any chance to have another look at this? Would be really nice to have that fixed! |
@thaJeztah @cpuguy83 Friendly ping 😄 . We've added IPv6 networking support to the gitlab-runner, but we have no way to use it as IPv6 networking in Docker is still broken, when used in this way. Is there any way I can support the efforts here? |
@akerouanton Would you mind to implement these small requested changes? It would really help us to use Docker with IPv6 🥺 |
9dce7c6
to
1fe1d92
Compare
@akerouanton Looks like the linter caught a typo;
And there's one test-failure (haven't looked if its a legit failure, or a test that needs updating);
|
@thaJeztah The failing test looks legit but I didn't had time to debug it. I'll fix that later and ping you when it's done 😉 |
37c7895
to
a2f5fed
Compare
@thaJeztah I'm not sure why the integration-cli job is failing. I don't reproduce this error on my computer. Could you take a look please? |
Hm... not sure; would have to check why it fails. I can give it a kick (in case there's some flakiness in that test, or (e.g.) some other daemon was still running). Copied the error below (in case we need to revisit)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in the process of cleaning up the IPAM packages (#44827 + WIP) to get rid of global mutable state and dead code. (Fun fact: ipam.NewAllocator
is only ever called with nil
datastores! All the datastore-related code can be thrown out.) Let's put this PR on hold until it can be rebased on top of the IPAM cleanup & refactor work, as that will open up a lot more flexibility to implement on-demand subnet splitting more cleanly.
I have spotted a much less invasive short-term mitigation for the IPv6 memory exhaustion issue which would be feasible to backport to 23.0: set a cap on the number of networks ipamutils.splitNetworks
will return. 2^24 networks per address space should be well beyond what could be allocated without running into other limitations, and small enough to not exhaust all memory.
libnetwork/ipam/allocator.go
Outdated
// resetCollections indicates whether the predefined NetworkSplitter can be reset when the end of these | ||
// collections is reached. This happens when the whole collection has been consumed but some of its subnets were | ||
// allocated and then released. | ||
resetCollections map[string]map[ipamutils.IPVersion]bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for resetCollections
makes me think that NetworkSplitter
as currently designed is not the right abstraction for allocating subnets. Interestingly, Allocator
uses a pretty memory-efficient data structure (bitseq.Handle
) for keeping track of address allocations within a network; keeping track of allocations within a contiguous range of networks is such a similar problem that the same strategy could work. And as luck would have it, I've been working on refactoring and documenting that data type! #44864
7165d51
to
9f5c4c8
Compare
predefined: pdf, | ||
subnets: map[netip.Prefix]*PoolData{}, | ||
predefined: map[ipamutils.IPVersion]*ipamutils.Subnetter{ | ||
ipamutils.IPv4: predefined.V4(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy with that: an address space with mixed address families feels weird. On the other hand ReleasePool
currently relies on the pool's string identifier to determine what address space is involved.
@corhere @thaJeztah I can make a follow-up PR to make addrSpace
handle a single address family if you think it'd be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, an address space with mixed address families does feel weird. I like the idea of having four addrSpace
structs for each of (local, global) x (IPv4, IPv6). Especially since it could be coded such that no IPv4/IPv6 enumeration is needed.
) | ||
|
||
// IPVersion represents the type of IP address. | ||
type IPVersion int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corhere @thaJeztah Is it okay if I introduce this new IPVersion type? Otherwise I can remove it and work on unifying all IPVersion types used across libnetwork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code can be structured such that no IPVersion constants are needed.
A new Subnetter structure is added to lazily sub-divide an address pool into subnets. This fixes moby#40275. Prior to this change, the list of NetworkToSplit was eagerly split into smaller subnets when ipamutils package was loaded, when ConfigGlobalScopeDefaultNetworks was called or when the function SetDefaultIPAddressPool from the default IPAM driver was called. In the latter case, if the list of NetworkToSplit contained an IPv6 prefix, eagerly enumerating all subnets could eat all the available memory. For instance, fd00::/8 split into /96 would take ~5*10^27 bytes. Although this change trades memory consumption for computation cost, the Subnetter is used by libnetwork/ipam package in such a way that it only have to compute the address of the next subnet. When the Subnetter reach the end of NetworkToSplit, it's resetted by libnetwork/ipam only if there were some subnets released beforehand. In such case, ipam package might iterate over all the subnets before finding one available. Also, the Subnetter leverages the newly introduced ipbits package, which handles IPv6 addresses correctly. Before this commit, a bitwise shift was overflowing uint64 and thus only a single subnet could be enumerated from an IPv6 prefix. This fixes moby#42801. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
9f5c4c8
to
b40f248
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's with all the "follow-up PR" talk? Is there some time pressure to get large pools fixed ASAP?
// Subnetter is an iterator lazily enumerating subnets from its NetworkToSplit. IPv6 | ||
// subnets can't be enumerated eagerly, otherwise they might end up taking way too much | ||
// memory (eg. fd00::/8 split into /96). See https://github.com/moby/moby/issues/40275. | ||
// Subnetter is not safe for concurrent use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Subnetter is not safe for concurrent use. |
predefined: pdf, | ||
subnets: map[netip.Prefix]*PoolData{}, | ||
predefined: map[ipamutils.IPVersion]*ipamutils.Subnetter{ | ||
ipamutils.IPv4: predefined.V4(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, an address space with mixed address families does feel weird. I like the idea of having four addrSpace
structs for each of (local, global) x (IPv4, IPv6). Especially since it could be coded such that no IPv4/IPv6 enumeration is needed.
var ErrNoMoreSubnet = errors.New("no more subnet available") | ||
|
||
// NextSubnet returns a new subnet from its NetworkToSplit, or an error if all subnets have already been enumerated. | ||
func (s *Subnetter) NextSubnet() (netip.Prefix, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (s *Subnetter) NextSubnet() (netip.Prefix, error) { | |
func (s *Subnetter) NextSubnet() (subnet netip.Prefix, ok bool) { |
// reset indicates whether the predefined subnetters can be reset when all subnets | ||
// have been enumerated but some have been released. | ||
reset map[ipamutils.IPVersion]bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't like this whole concept of holding the mutable iterator state with the collection being iterated over, and having this "can be reset" flag to paper over the issues stemming from that design. For one, there is no clean way to iterate over a permutation of the collection without repeating values. A concrete toy example might better illustrate what I'm trying to say.
Consider the following iterator, where ^
signifies the current position.
[ 1 2 3 4 5 ]
^
With an API similar to Subnetter
, attempting to iterate over the whole collection starting from the current position would yield the sequence 2 3 4 5 1 2 3 4 5
. If one could assume the values in the collection are all unique, the iterating code could remember that 2
was the first value iterated over and stop once it iterates over 2
again. But that would leave the current position incremented one past where iteration "stopped," a fencepost error.
I think a better design would be to separate the iterator from the collection. Make the collection immutable and the iterator ephemeral, yielding exactly one permutation of the collection. Here's a sketch of what that could look like:
// Cursor returns a cursor positioned at the start of the pool.
func (SubnetPool) Cursor() SubnetCursor
// Next returns a cursor moved forward by one position.
func (SubnetCursor) Advance() SubnetCursor
// Iter returns a new iterator over a permutation of the pool starting from the cursor position.
func (SubnetCursor) Iter() *SubnetIter
// Next advances the iterator cursor by one and returns the subnet at that position.
// If advancing the cursor would position it at a subnet
// which has already been returned by a prior call to Next,
// the cursor is not advanced and ok=false is returned.
func (*SubnetIter) Next() (v netip.Prefix, ok bool)
// Cursor returns a cursor for the current position.
func (SubnetIter) Cursor() SubnetCursor
The idea is to use the cursor in the same way as predefinedStartIndex
: remember the last place iteration terminated so the next round of iteration knows where to start and stop.
) | ||
|
||
// IPVersion represents the type of IP address. | ||
type IPVersion int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code can be structured such that no IPVersion constants are needed.
So, after some more tests, it turns out this PR makes clear that either the way we check for overlapping subnets is quite naive (it works well for IPv4, but not for IPv6) and/or the current For instance, if a user specify a default address pool with the base CIDR $ cat /etc/docker/daemon.json
{
"log-level": "info",
"bip": "10.0.0.1/8",
"fixed-cidr": "10.0.0.0/8",
"default-address-pools": [
{ "base": "10.0.0.0/8", "size": 30 }
]
}
$ docker network create testnet1 Now, another case where that same bug happens. Two default address pools are defined: the first one yielding a single subnet and matching the bip/fixed-cidr params, and another, random, address pool (eg. $ cat /etc/docker/daemon.json
{
"bip": "172.19.0.1/24",
"fixed-cidr": "172.19.0.0/24",
"default-address-pools": [
{ "base": "172.19.0.0/24", "size": 24 },
{ "base": "10.0.0.0/8", "size": 24 }
]
}
$ docker network create --subnet=10.0.0.0/8 testnet1
$ docker network create testnet2 Since I can only come up with these two options to fix this issue:
|
Earlier response
Realistically that many subnets is not going to be used. Current defaults AFAIK has a much smaller amount of subnets available in the pool? (IIRC, this is poorly documented though) The issue here though is:
Only one of those configurations is valid to support at a time, and could be detected when starting the daemon?
Same logic applies for IPv6.
This is a bit more interesting. But same concern, it's not that pragmatic to iterate through unallocated pools like that when realistically the pools already convey reservation of the This check is only important when creating a new network correct? Networks that are created persist their config with the assigned subnet, and pragmatically, how large does the number of assigned subnets get? It isn't likely to be that expensive to track with a hashmap is it? EDIT: I might have misunderstood, is this about:
Doesn't really change my feedback though either way.
Why is it not viable to track what subnets are actually in use? The pool is just defining blocks of address space that can be assigned a subnet of
Seems we're on the same page 😀 👍 Kubernetes uses bitmaps apparently. Roaring compressed bitmaps were adopted as an improvement. There is some recent related work on k8s, so I think they're still preferring that. There's also a nice article on investigating a few data structures for working with IPs (Apnic), although I'm not sure how well any handle range queries to detect overlapping subnets 😅 I put something together with some rust code, data structure isn't simpler but I deferred that to existing libraries. impl SubnetTree for SubnetManager<IPv6> {
fn add(self: &mut Self, cidr_block: &str) {
let subnet = Self::subnet_from_range(cidr_block);
if let Some(overlapped) = self.tree.search(subnet).next() {
println!("{} overlaps IP range with subnet {}", cidr_block, overlapped.data);
} else {
self.tree.insert(subnet, cidr_block.to_string());
println!("Added subnet {}", cidr_block);
}
}
} Main logic is mostly that, and can be used like this: fn main() {
let mut subnets = SubnetManager::<IPv6>::new();
let existing = ["fd00:feed:face:cafe::/64", "fd00:feed:face:beef::/64"];
// These add without issue:
for cidr_block in existing {
subnets.add(cidr_block);
}
// These will fail due to overlap:
subnets.add("fd00:feed:face:beef::/48");
subnets.add("fd00:feed:face::/48");
} Works with UPDATE Nov 2023: Looks like a PR similar to the overlap check in my above example was implemented here: #46755 |
I bundled these two commits together as the second one is easier to test with changes done in the first one. I can still break this PR into two parts and rewrite the unit test of the second commit to test it with the current master branch. Let me know if you prefer that (eg. if you want to cherry-pick this change into older releases or because it's a big diff to review).
- 1st commit
This commit resolves #40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.
Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.
Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.
- 2nd commit
Prior to this change, IPv6 subnets generation was broken because of a
bitwise shift overflowing uint64.
To solve this issue, addIntToIP has been changed to take an addend and
computes two ordinals applied to respectively the lower 64 bits and the
upper 64 bits of IPv6 addresses.
Nothing change for IPv4 (ie. upperOrdinal is always 0).
- How to verify it
I added unit tests to confirm these changes are okay. Moreover, I tried to run dockerd with following config:
When starting dockerd v20.10.10 with this config, the daemon crashes with this panic message message:
runtime error: makeslice: cap out of range
. When starting a patched version of dockerd, it doesn't crash.- Description for the changelog