New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libnet/ipams/default: introduce a linear allocator #47768
base: master
Are you sure you want to change the base?
Conversation
aae4040
to
9c6196f
Compare
9c6196f
to
59e3d2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
} | ||
|
||
if p.Addr().Is4() { | ||
v4 = append(v4, p) | ||
if n.Base.Addr().Is4() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hasn't changed - but Is4
is false for an IPv4-mapped IPv6 addresses. It might be worth checking for that and storing the unmapped prefix in the IPv4 list, or just bailing out?
(We don't deal with mapped addresses in command line options either, but it might be less obvious here - the address pool just won't do anything useful. I'm not quite sure why someone would want to write IPv4 addresses as IPv6, but there's an issue somewhere asking us allow it.)
@corhere Please make sure this won't break Swarm in any way (eg. when a new leader is elected, etc...). |
The previous allocator was subnetting address pools eagerly when the daemon started, and would then just iterate over that list whenever RequestPool was called. This was leading to high memory usage whenever IPv6 pools were configured with a target subnet size too different from the pools prefix size. For instance: pool = fd00::/8, target size = /64 -- 2 ^ (64-8) subnets would be generated upfront. This would take approx. 9 * 10^18 bits -- way too much for any human computer in 2024. Another noteworthy issue, the previous implementation was allocating a subnet, and then in another layer was checking whether the allocation was conflicting with some 'reserved networks'. If so, the allocation would be retried, etc... To make it worse, 'reserved networks' would be recomputed on every iteration. This is totally ineffective as there could be 'reserved networks' that fully overlap a given address pool (or many!). To fix this issue, a new field `Exclude` is added to `RequestPool`. It's up to each driver to take it into account. Since we don't know whether this retry loop is useful for some remote IPAM driver, it's reimplemented bug-for-bug directly in the remote driver. The new allocator uses a linear-search algorithm. It takes advantage of all lists (predefined pools, allocated subnets and reserved networks) being sorted and logically combines 'allocated' and 'reserved' through a 'double cursor' to iterate on both lists at the same time while preserving the total order. At the same time, it iterates over 'predefined' pools and looks for the first empty space that would be a good fit. Currently, the size of the allocated subnet is still dictated by each 'predefined' pools. We should consider hardcoding that size instead, and let users specify what subnet size they want. This wasn't possible before as the subnets were generated upfront. This new allocator should be able to deal with this easily. The method used for static allocation has been updated to make sure the ascending order of 'allocated' is preserved. It's bug-for-bug compatible with the previous implementation. One consequence of this new algorithm is that we don't keep track of where the last allocation happened, we just allocate the first free subnet we find. Before: - Allocate: 10.0.1.0/24, 10.0.2.0/24 ; Deallocate: 10.0.1.0/24 ; Allocate 10.0.3.0/24. Now, the 3rd allocation would yield 10.0.1.0/24 once again. As it doesn't change the semantics of the allocator, there's no reason to worry about that. Finally, about 'reserved networks'. The heuristics we use are now properly documented. It was discovered that we don't check routes for IPv6 allocations -- this can't be changed because there's no such thing as on-link routes for IPv6. (Kudos to Rob Murray for coming up with the linear-search idea.) Signed-off-by: Albin Kerouanton <albinker@gmail.com>
This normalization process does two things: - Unmap IPv4-mapped IPv6 addrs. This ensures such address pools are part of the IPv4 address space. - Mask the host ID. This was done by newAddrSpace but `splitByIPFamily` is already validating / normalizing the address pools. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Nothing was validating whether address pools' `base` prefix were larger than the target subnet `size` they're associated to. As such invalid address pools would yield no subnet, the error could go unnoticed. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
59e3d2a
to
304e175
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review WIP; I haven't even finished with address_space.go. I'll be back tomorrow.
var last *ipamutils.NetworkToSplit | ||
var discarded int | ||
for i, imax := 0, len(predefined); i < imax; i++ { | ||
p := predefined[i-discarded] | ||
if last != nil && last.Overlaps(p.Base) { | ||
predefined = slices.Delete(predefined, i-discarded, i-discarded+1) | ||
discarded++ | ||
continue | ||
} | ||
last = p | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the slices
package is already being used, may as well take full advantage.
var last *ipamutils.NetworkToSplit | |
var discarded int | |
for i, imax := 0, len(predefined); i < imax; i++ { | |
p := predefined[i-discarded] | |
if last != nil && last.Overlaps(p.Base) { | |
predefined = slices.Delete(predefined, i-discarded, i-discarded+1) | |
discarded++ | |
continue | |
} | |
last = p | |
} | |
predefined = slices.CompactFunc(predefined, func(last, p *ipamutils.NetworkToSplit) bool { | |
return last.Overlaps(p.Base) | |
}) |
for i, allocated := range aSpace.allocated { | ||
if nw.Addr().Compare(allocated.Addr()) < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aSpace.allocated
is a sorted slice, which means binary searching is possible. Turn that O(n) search into O(log n) time complexity!
func (aSpace *addrSpace) allocatePool(nw netip.Prefix) error {
n, _ := slices.BinarySearchFunc(aSpace.allocated, nw, func(allocated, nw netip.Prefix) int { return nw.Addr().Compare(allocated.Addr()) })
aSpace.allocated = slices.Insert(aSpace.allocated, n, nw)
aSpace.subnets[nw] = newPoolData(nw)
return nil
}
Also, are duplicate allocations allowed? It would be trivial to detect this situation and return an error instead of inserting the duplicate entry into the slice.
var partialOverlap bool | ||
var prevAlloc netip.Prefix | ||
|
||
slices.SortFunc(reserved, netiputil.Compare) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code mutates the backing array of the reserved
slice, which is owned by the driver. I feel that violates the principle of least astonishment. How about making it part of the IPAM API contract that the reserved slice needs to already be sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new allocator should work fine with Swarm. The CNM network allocator replays allocations as static assignments if there is an existing allocation in the Swarm state.
moby/libnetwork/cnmallocator/networkallocator.go
Lines 866 to 873 in 4554d87
// If there is non-nil IPAM state always prefer those subnet | |
// configs over Spec configs. | |
if n.IPAM != nil { | |
ipamConfigs = n.IPAM.Configs | |
} else if n.Spec.IPAM != nil { | |
ipamConfigs = make([]*api.IPAMConfig, len(n.Spec.IPAM.Configs)) | |
copy(ipamConfigs, n.Spec.IPAM.Configs) | |
} |
a []netip.Prefix | ||
b []netip.Prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a []netip.Prefix | |
b []netip.Prefix | |
a, b []netip.Prefix |
// doubleCursor is used to iterate on both 'a' and 'b' at the same time while | ||
// maintaining the total order that would arise if both were merged and then | ||
// sorted. Both 'a' and 'b' have to be sorted beforehand. | ||
type doubleCursor struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it iterates over collections, why not just call it an iterator? Iterators are being added to the language in Go 1.23. And the operation it is performing is more commonly known as "merge." (C++, Rust 1, Rust 2, Go proposal)
// doubleCursor is used to iterate on both 'a' and 'b' at the same time while | |
// maintaining the total order that would arise if both were merged and then | |
// sorted. Both 'a' and 'b' have to be sorted beforehand. | |
type doubleCursor struct { | |
// mergeIter is used to iterate on both 'a' and 'b' at the same time while | |
// maintaining the total order that would arise if both were merged and then | |
// sorted. Both 'a' and 'b' have to be sorted beforehand. | |
type mergeIter struct { |
func (dc *doubleCursor) Inc() { | ||
if dc.lastA { | ||
dc.ia++ | ||
} else { | ||
dc.ib++ | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling dc.Inc(); dc.Inc()
would misbehave as lastA
does not get updated. I worry that (even with documentation) someone would try something like that in the future and shoot themselves in the foot. It doesn't look like it would be too hard to assign to lastA
inside Inc()
and make Get()
a pure function.
for i, allocated := range aSpace.allocated { | ||
if allocated.Addr().Compare(nw.Addr()) == 0 && allocated.Bits() == nw.Bits() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary search?
localAddressPools := ipamutils.GetLocalScopeDefaultNetworks() | ||
if len(lAddrPools) > 0 { | ||
var err error | ||
localAddressPools, err = ipamutils.SplitNetworks(lAddrPools) | ||
if err != nil { | ||
return err | ||
} | ||
localAddressPools = lAddrPools | ||
} | ||
|
||
globalAddressPools := ipamutils.GetGlobalScopeDefaultNetworks() | ||
if len(gAddrPools) > 0 { | ||
var err error | ||
globalAddressPools, err = ipamutils.SplitNetworks(gAddrPools) | ||
if err != nil { | ||
return err | ||
} | ||
globalAddressPools = gAddrPools | ||
} | ||
|
||
a, err := NewAllocator(localAddressPools, globalAddressPools) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be written both more compactly, and also (slightly) more efficiently by not doing unnecessary work when the pools are provided in the arguments.
if len(lAddrPools) == 0 {
lAddrPools = ipamutils.GetLocalScopeDefaultNetworks()
}
if len(gAddrPools) == 0 {
gAddrPools = ipamutils.GetGlobalScopeDefaultNetworks()
}
a, err := NewAllocator(lAddrPools, gAddrPools)
// prefix is found to overlap but users care enough about it being associated | ||
// to a Docker network they can still rely on static allocation. | ||
// | ||
// For IPv6, the 2nd heuristics isn't applied as there's no such thing as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// For IPv6, the 2nd heuristics isn't applied as there's no such thing as | |
// For IPv6, the 2nd heuristic isn't applied as there's no such thing as |
// We don't really care if os.ReadFile fails here. It either doesn't exist, | ||
// or we can't read it for some reason. | ||
if rc, err := os.ReadFile(resolvconf.Path()); err == nil { | ||
reserved = filterByFamily(resolvconf.GetNameserversAsPrefix(rc), v6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reserved = filterByFamily(resolvconf.GetNameserversAsPrefix(rc), v6) | |
reserved = slices.DeleteFunc(resolvconf.GetNameserversAsPrefix(rc), func(p netip.Prefix) bool { return p.Addr().Is6() != v6 }) |
reserved = filterByFamily(resolvconf.GetNameserversAsPrefix(rc), v6) | ||
} | ||
|
||
return append(reserved, queryOnLinkRoutes(v6)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return append(reserved, queryOnLinkRoutes(v6)...) | |
if !v6 { | |
reserved = append(reserved, queryOnLinkRoutes()...) | |
} |
// queryOnLinkRoutes returns a list of on-link routes available on the host. | ||
// Since there's no such thing as on-link routes for IPv6, this function | ||
// always returns an empty list. | ||
func queryOnLinkRoutes(v6 bool) []netip.Prefix { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// queryOnLinkRoutes returns a list of on-link routes available on the host. | |
// Since there's no such thing as on-link routes for IPv6, this function | |
// always returns an empty list. | |
func queryOnLinkRoutes(v6 bool) []netip.Prefix { | |
// queryOnLinkRoutes returns a list of on-link routes available on the host. | |
// Only IPv4 prefixes are returned as there's no such thing as on-link | |
// routes for IPv6. | |
func queryOnLinkRoutes() []netip.Prefix { |
} | ||
|
||
if p.Addr().Is4() { | ||
v4 = append(v4, p) | ||
n.Base = netip.PrefixFrom(n.Base.Addr().Unmap(), n.Base.Bits()).Masked() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n.Base = netip.PrefixFrom(n.Base.Addr().Unmap(), n.Base.Bits()).Masked() | |
n.Base, _ = n.Base.Addr().Unmap().Prefix(n.Base.Bits()) |
- What I did
This previous allocator was subnetting address pools eagerly when the daemon started, and would then just iterate over that list whenever RequestPool was called. This was leading to high memory usage whenever IPv6 pools were configured with a target subnet size too different from the pools prefix size.
For instance: pool = fd00::/8, target size = /64 -- 2 ^ (64-8) subnets would be generated upfront. This would take approx. 9 * 10^18 bits -- way too much for any human computer in 2024.
Another noteworthy issue, the previous implementation was allocating a subnet, and then in another layer was checking whether the allocation was conflicting with some 'reserved networks'. If so, the allocation would be retried, etc... To make it worse, 'reserved networks' would be recomputed on every iteration. This is totally ineffective as there could be 'reserved networks' that fully overlap a given address pool (or many!).
To fix this issue, a new field
Exclude
is added toRequestPool
. It's up to each driver to take it into account. Since we don't know whether this retry loop is useful for some remote IPAM driver, it's reimplemented bug-for-bug directly in the remote driver.The new allocator uses a linear-search algorithm. It takes advantage of all lists (predefined pools, allocated subnets and reserved networks) being sorted and logically combines 'allocated' and 'reserved' through a 'double cursor' to iterate on both lists at the same time while preserving the total order. At the same time, it iterates over 'predefined' pools and looks for the first empty space that would be a good fit.
Currently, the size of the allocated subnet is still dictated by each 'predefined' pools. We should consider hardcoding that size instead, and let users specify what subnet size they want. This wasn't possible before as the subnets were generated upfront. This new allocator should be able to deal with this easily.
The method used for static allocation has been updated to make sure the ascending order of 'allocated' is preserved. It's bug-for-bug compatible with the previous implementation.
One consequence of this new algorithm is that we don't keep track of where the last allocation happened, we just allocate the first free subnet we find.
Before:
Now, the 3rd allocation would yield 10.0.1.0/24 once again.
As it doesn't change the semantics of the allocator, there's no reason to worry about that.
Finally, about 'reserved networks'. The heuristics we use are now properly documented. It was discovered that we don't check routes for IPv6 allocations -- this can't be changed because there's no such thing as on-link routes for IPv6.
(Kudos to Rob Murray for coming up with the linear-search idea.)
- How to verify it
CI -- a bunch of tests have been added, some have been rewritten.
Or manually by creating, deleting and re-creating networks.
- Description for the changelog
- Introduce a new subnet allocator that can deal with IPv6 address pools of any size
- A picture of a cute animal (not mandatory but encouraged)