Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libnet/ipams/default: introduce a linear allocator #47768

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

akerouanton
Copy link
Member

@akerouanton akerouanton commented Apr 26, 2024

- What I did

This previous allocator was subnetting address pools eagerly when the daemon started, and would then just iterate over that list whenever RequestPool was called. This was leading to high memory usage whenever IPv6 pools were configured with a target subnet size too different from the pools prefix size.

For instance: pool = fd00::/8, target size = /64 -- 2 ^ (64-8) subnets would be generated upfront. This would take approx. 9 * 10^18 bits -- way too much for any human computer in 2024.

Another noteworthy issue, the previous implementation was allocating a subnet, and then in another layer was checking whether the allocation was conflicting with some 'reserved networks'. If so, the allocation would be retried, etc... To make it worse, 'reserved networks' would be recomputed on every iteration. This is totally ineffective as there could be 'reserved networks' that fully overlap a given address pool (or many!).

To fix this issue, a new field Exclude is added to RequestPool. It's up to each driver to take it into account. Since we don't know whether this retry loop is useful for some remote IPAM driver, it's reimplemented bug-for-bug directly in the remote driver.

The new allocator uses a linear-search algorithm. It takes advantage of all lists (predefined pools, allocated subnets and reserved networks) being sorted and logically combines 'allocated' and 'reserved' through a 'double cursor' to iterate on both lists at the same time while preserving the total order. At the same time, it iterates over 'predefined' pools and looks for the first empty space that would be a good fit.

Currently, the size of the allocated subnet is still dictated by each 'predefined' pools. We should consider hardcoding that size instead, and let users specify what subnet size they want. This wasn't possible before as the subnets were generated upfront. This new allocator should be able to deal with this easily.

The method used for static allocation has been updated to make sure the ascending order of 'allocated' is preserved. It's bug-for-bug compatible with the previous implementation.

One consequence of this new algorithm is that we don't keep track of where the last allocation happened, we just allocate the first free subnet we find.

Before:

  • Allocate: 10.0.1.0/24, 10.0.2.0/24 ; Deallocate: 10.0.1.0/24 ; Allocate 10.0.3.0/24.

Now, the 3rd allocation would yield 10.0.1.0/24 once again.

As it doesn't change the semantics of the allocator, there's no reason to worry about that.

Finally, about 'reserved networks'. The heuristics we use are now properly documented. It was discovered that we don't check routes for IPv6 allocations -- this can't be changed because there's no such thing as on-link routes for IPv6.

(Kudos to Rob Murray for coming up with the linear-search idea.)

- How to verify it

CI -- a bunch of tests have been added, some have been rewritten.

Or manually by creating, deleting and re-creating networks.

- Description for the changelog

- Introduce a new subnet allocator that can deal with IPv6 address pools of any size

- A picture of a cute animal (not mandatory but encouraged)

@akerouanton akerouanton added status/2-code-review kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. area/networking impact/changelog area/networking/ipam labels Apr 26, 2024
@akerouanton akerouanton added this to the 27.0.0 milestone Apr 26, 2024
@akerouanton akerouanton self-assigned this Apr 26, 2024
@akerouanton akerouanton requested a review from robmry April 26, 2024 20:49
@akerouanton akerouanton marked this pull request as ready for review April 29, 2024 11:49
Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

libnetwork/ipams/defaultipam/address_space_test.go Outdated Show resolved Hide resolved
}

if p.Addr().Is4() {
v4 = append(v4, p)
if n.Base.Addr().Is4() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hasn't changed - but Is4 is false for an IPv4-mapped IPv6 addresses. It might be worth checking for that and storing the unmapped prefix in the IPv4 list, or just bailing out?

(We don't deal with mapped addresses in command line options either, but it might be less obvious here - the address pool just won't do anything useful. I'm not quite sure why someone would want to write IPv4 addresses as IPv6, but there's an issue somewhere asking us allow it.)

libnetwork/ipbits/ipbits.go Outdated Show resolved Hide resolved
@akerouanton
Copy link
Member Author

@corhere Please make sure this won't break Swarm in any way (eg. when a new leader is elected, etc...).

The previous allocator was subnetting address pools eagerly
when the daemon started, and would then just iterate over that
list whenever RequestPool was called. This was leading to high
memory usage whenever IPv6 pools were configured with a target
subnet size too different from the pools prefix size.

For instance: pool = fd00::/8, target size = /64 -- 2 ^ (64-8)
subnets would be generated upfront. This would take approx.
9 * 10^18 bits -- way too much for any human computer in 2024.

Another noteworthy issue, the previous implementation was allocating
a subnet, and then in another layer was checking whether the
allocation was conflicting with some 'reserved networks'. If so,
the allocation would be retried, etc... To make it worse, 'reserved
networks' would be recomputed on every iteration. This is totally
ineffective as there could be 'reserved networks' that fully overlap
a given address pool (or many!).

To fix this issue, a new field `Exclude` is added to `RequestPool`.
It's up to each driver to take it into account. Since we don't know
whether this retry loop is useful for some remote IPAM driver, it's
reimplemented bug-for-bug directly in the remote driver.

The new allocator uses a linear-search algorithm. It takes advantage
of all lists (predefined pools, allocated subnets and reserved
networks) being sorted and logically combines 'allocated' and
'reserved' through a 'double cursor' to iterate on both lists at the
same time while preserving the total order. At the same time, it
iterates over 'predefined' pools and looks for the first empty space
that would be a good fit.

Currently, the size of the allocated subnet is still dictated by
each 'predefined' pools. We should consider hardcoding that size
instead, and let users specify what subnet size they want. This
wasn't possible before as the subnets were generated upfront. This
new allocator should be able to deal with this easily.

The method used for static allocation has been updated to make sure
the ascending order of 'allocated' is preserved. It's bug-for-bug
compatible with the previous implementation.

One consequence of this new algorithm is that we don't keep track
of where the last allocation happened, we just allocate the first
free subnet we find.

Before:

- Allocate: 10.0.1.0/24, 10.0.2.0/24 ; Deallocate: 10.0.1.0/24 ;
Allocate 10.0.3.0/24.

Now, the 3rd allocation would yield 10.0.1.0/24 once again.

As it doesn't change the semantics of the allocator, there's no
reason to worry about that.

Finally, about 'reserved networks'. The heuristics we use are
now properly documented. It was discovered that we don't check
routes for IPv6 allocations -- this can't be changed because
there's no such thing as on-link routes for IPv6.

(Kudos to Rob Murray for coming up with the linear-search idea.)

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
This normalization process does two things:

- Unmap IPv4-mapped IPv6 addrs. This ensures such address pools are
part of the IPv4 address space.
- Mask the host ID. This was done by newAddrSpace but `splitByIPFamily`
is already validating / normalizing the address pools.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Nothing was validating whether address pools' `base` prefix
were larger than the target subnet `size` they're associated to. As
such invalid address pools would yield no subnet, the error could go
unnoticed.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
@akerouanton akerouanton force-pushed the libnet-ipam-linear-allocator branch from 59e3d2a to 304e175 Compare May 7, 2024 06:57
@akerouanton akerouanton requested a review from robmry May 7, 2024 06:57
Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review WIP; I haven't even finished with address_space.go. I'll be back tomorrow.

Comment on lines +41 to +51
var last *ipamutils.NetworkToSplit
var discarded int
for i, imax := 0, len(predefined); i < imax; i++ {
p := predefined[i-discarded]
if last != nil && last.Overlaps(p.Base) {
predefined = slices.Delete(predefined, i-discarded, i-discarded+1)
discarded++
continue
}
last = p
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the slices package is already being used, may as well take full advantage.

Suggested change
var last *ipamutils.NetworkToSplit
var discarded int
for i, imax := 0, len(predefined); i < imax; i++ {
p := predefined[i-discarded]
if last != nil && last.Overlaps(p.Base) {
predefined = slices.Delete(predefined, i-discarded, i-discarded+1)
discarded++
continue
}
last = p
}
predefined = slices.CompactFunc(predefined, func(last, p *ipamutils.NetworkToSplit) bool {
return last.Overlaps(p.Base)
})

Comment on lines +120 to +121
for i, allocated := range aSpace.allocated {
if nw.Addr().Compare(allocated.Addr()) < 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aSpace.allocated is a sorted slice, which means binary searching is possible. Turn that O(n) search into O(log n) time complexity!

func (aSpace *addrSpace) allocatePool(nw netip.Prefix) error {
	n, _ := slices.BinarySearchFunc(aSpace.allocated, nw, func(allocated, nw netip.Prefix) int { return nw.Addr().Compare(allocated.Addr()) })
	aSpace.allocated = slices.Insert(aSpace.allocated, n, nw)
	aSpace.subnets[nw] = newPoolData(nw)
	return nil
}

Also, are duplicate allocations allowed? It would be trivial to detect this situation and return an error instead of inserting the duplicate entry into the slice.

var partialOverlap bool
var prevAlloc netip.Prefix

slices.SortFunc(reserved, netiputil.Compare)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code mutates the backing array of the reserved slice, which is owned by the driver. I feel that violates the principle of least astonishment. How about making it part of the IPAM API contract that the reserved slice needs to already be sorted?

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new allocator should work fine with Swarm. The CNM network allocator replays allocations as static assignments if there is an existing allocation in the Swarm state.

// If there is non-nil IPAM state always prefer those subnet
// configs over Spec configs.
if n.IPAM != nil {
ipamConfigs = n.IPAM.Configs
} else if n.Spec.IPAM != nil {
ipamConfigs = make([]*api.IPAMConfig, len(n.Spec.IPAM.Configs))
copy(ipamConfigs, n.Spec.IPAM.Configs)
}

Comment on lines +80 to +81
a []netip.Prefix
b []netip.Prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a []netip.Prefix
b []netip.Prefix
a, b []netip.Prefix

Comment on lines +76 to +79
// doubleCursor is used to iterate on both 'a' and 'b' at the same time while
// maintaining the total order that would arise if both were merged and then
// sorted. Both 'a' and 'b' have to be sorted beforehand.
type doubleCursor struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it iterates over collections, why not just call it an iterator? Iterators are being added to the language in Go 1.23. And the operation it is performing is more commonly known as "merge." (C++, Rust 1, Rust 2, Go proposal)

Suggested change
// doubleCursor is used to iterate on both 'a' and 'b' at the same time while
// maintaining the total order that would arise if both were merged and then
// sorted. Both 'a' and 'b' have to be sorted beforehand.
type doubleCursor struct {
// mergeIter is used to iterate on both 'a' and 'b' at the same time while
// maintaining the total order that would arise if both were merged and then
// sorted. Both 'a' and 'b' have to be sorted beforehand.
type mergeIter struct {

Comment on lines +114 to +120
func (dc *doubleCursor) Inc() {
if dc.lastA {
dc.ia++
} else {
dc.ib++
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling dc.Inc(); dc.Inc() would misbehave as lastA does not get updated. I worry that (even with documentation) someone would try something like that in the future and shoot themselves in the foot. It doesn't look like it would be too hard to assign to lastA inside Inc() and make Get() a pure function.

Comment on lines +313 to +314
for i, allocated := range aSpace.allocated {
if allocated.Addr().Compare(nw.Addr()) == 0 && allocated.Bits() == nw.Bits() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary search?

Comment on lines 30 to 40
localAddressPools := ipamutils.GetLocalScopeDefaultNetworks()
if len(lAddrPools) > 0 {
var err error
localAddressPools, err = ipamutils.SplitNetworks(lAddrPools)
if err != nil {
return err
}
localAddressPools = lAddrPools
}

globalAddressPools := ipamutils.GetGlobalScopeDefaultNetworks()
if len(gAddrPools) > 0 {
var err error
globalAddressPools, err = ipamutils.SplitNetworks(gAddrPools)
if err != nil {
return err
}
globalAddressPools = gAddrPools
}

a, err := NewAllocator(localAddressPools, globalAddressPools)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be written both more compactly, and also (slightly) more efficiently by not doing unnecessary work when the pools are provided in the arguments.

if len(lAddrPools) == 0 {
        lAddrPools = ipamutils.GetLocalScopeDefaultNetworks()
}
if len(gAddrPools) == 0 {
        gAddrPools = ipamutils.GetGlobalScopeDefaultNetworks()
}
a, err := NewAllocator(lAddrPools, gAddrPools)

// prefix is found to overlap but users care enough about it being associated
// to a Docker network they can still rely on static allocation.
//
// For IPv6, the 2nd heuristics isn't applied as there's no such thing as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For IPv6, the 2nd heuristics isn't applied as there's no such thing as
// For IPv6, the 2nd heuristic isn't applied as there's no such thing as

// We don't really care if os.ReadFile fails here. It either doesn't exist,
// or we can't read it for some reason.
if rc, err := os.ReadFile(resolvconf.Path()); err == nil {
reserved = filterByFamily(resolvconf.GetNameserversAsPrefix(rc), v6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reserved = filterByFamily(resolvconf.GetNameserversAsPrefix(rc), v6)
reserved = slices.DeleteFunc(resolvconf.GetNameserversAsPrefix(rc), func(p netip.Prefix) bool { return p.Addr().Is6() != v6 })

reserved = filterByFamily(resolvconf.GetNameserversAsPrefix(rc), v6)
}

return append(reserved, queryOnLinkRoutes(v6)...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return append(reserved, queryOnLinkRoutes(v6)...)
if !v6 {
reserved = append(reserved, queryOnLinkRoutes()...)
}

Comment on lines +83 to +86
// queryOnLinkRoutes returns a list of on-link routes available on the host.
// Since there's no such thing as on-link routes for IPv6, this function
// always returns an empty list.
func queryOnLinkRoutes(v6 bool) []netip.Prefix {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// queryOnLinkRoutes returns a list of on-link routes available on the host.
// Since there's no such thing as on-link routes for IPv6, this function
// always returns an empty list.
func queryOnLinkRoutes(v6 bool) []netip.Prefix {
// queryOnLinkRoutes returns a list of on-link routes available on the host.
// Only IPv4 prefixes are returned as there's no such thing as on-link
// routes for IPv6.
func queryOnLinkRoutes() []netip.Prefix {

}

if p.Addr().Is4() {
v4 = append(v4, p)
n.Base = netip.PrefixFrom(n.Base.Addr().Unmap(), n.Base.Bits()).Masked()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n.Base = netip.PrefixFrom(n.Base.Addr().Unmap(), n.Base.Bits()).Masked()
n.Base, _ = n.Base.Addr().Unmap().Prefix(n.Base.Bits())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking/ipam area/networking impact/changelog kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. status/2-code-review
Projects
None yet
3 participants