libnet/ipams/default: introduce a linear allocator #47768

akerouanton · 2024-04-26T16:34:16Z

Close IPv6 address pool subnet smaller than /80 causes dockerd to consume all available RAM #40275
Close splitNetwork and addIntToIP fails for IPv6 addresses on /64+ size pools #42801
Supersede / Close libnet/ipam: Lazily sub-divide pools into subnets #43033
Stacked on top of libnet/ipam: Various clean-ups #47727

- What I did

This previous allocator was subnetting address pools eagerly when the daemon started, and would then just iterate over that list whenever RequestPool was called. This was leading to high memory usage whenever IPv6 pools were configured with a target subnet size too different from the pools prefix size.

For instance: pool = fd00::/8, target size = /64 -- 2 ^ (64-8) subnets would be generated upfront. This would take approx. 9 * 10^18 bits -- way too much for any human computer in 2024.

Another noteworthy issue, the previous implementation was allocating a subnet, and then in another layer was checking whether the allocation was conflicting with some 'reserved networks'. If so, the allocation would be retried, etc... To make it worse, 'reserved networks' would be recomputed on every iteration. This is totally ineffective as there could be 'reserved networks' that fully overlap a given address pool (or many!).

To fix this issue, a new field Exclude is added to RequestPool. It's up to each driver to take it into account. Since we don't know whether this retry loop is useful for some remote IPAM driver, it's reimplemented bug-for-bug directly in the remote driver.

The new allocator uses a linear-search algorithm. It takes advantage of all lists (predefined pools, allocated subnets and reserved networks) being sorted and logically combines 'allocated' and 'reserved' through a 'double cursor' to iterate on both lists at the same time while preserving the total order. At the same time, it iterates over 'predefined' pools and looks for the first empty space that would be a good fit.

Currently, the size of the allocated subnet is still dictated by each 'predefined' pools. We should consider hardcoding that size instead, and let users specify what subnet size they want. This wasn't possible before as the subnets were generated upfront. This new allocator should be able to deal with this easily.

The method used for static allocation has been updated to make sure the ascending order of 'allocated' is preserved. It's bug-for-bug compatible with the previous implementation.

One consequence of this new algorithm is that we don't keep track of where the last allocation happened, we just allocate the first free subnet we find.

Before:

Allocate: 10.0.1.0/24, 10.0.2.0/24 ; Deallocate: 10.0.1.0/24 ; Allocate 10.0.3.0/24.

Now, the 3rd allocation would yield 10.0.1.0/24 once again.

As it doesn't change the semantics of the allocator, there's no reason to worry about that.

Finally, about 'reserved networks'. The heuristics we use are now properly documented. It was discovered that we don't check routes for IPv6 allocations -- this can't be changed because there's no such thing as on-link routes for IPv6.

(Kudos to Rob Murray for coming up with the linear-search idea.)

- How to verify it

CI -- a bunch of tests have been added, some have been rewritten.

Or manually by creating, deleting and re-creating networks.

- Description for the changelog

- Introduce a new subnet allocator that can deal with IPv6 address pools of any size

- A picture of a cute animal (not mandatory but encouraged)

libnetwork/internal/netiputil/netiputil.go

robmry

LGTM!

libnetwork/ipams/defaultipam/address_space_test.go

robmry · 2024-04-29T10:49:03Z

libnetwork/ipams/defaultipam/allocator.go

 		}

-		if p.Addr().Is4() {
-			v4 = append(v4, p)
+		if n.Base.Addr().Is4() {


This hasn't changed - but Is4 is false for an IPv4-mapped IPv6 addresses. It might be worth checking for that and storing the unmapped prefix in the IPv4 list, or just bailing out?

(We don't deal with mapped addresses in command line options either, but it might be less obvious here - the address pool just won't do anything useful. I'm not quite sure why someone would want to write IPv4 addresses as IPv6, but there's an issue somewhere asking us allow it.)

libnetwork/ipbits/ipbits.go

akerouanton · 2024-04-30T16:17:40Z

@corhere Please make sure this won't break Swarm in any way (eg. when a new leader is elected, etc...).

The previous allocator was subnetting address pools eagerly when the daemon started, and would then just iterate over that list whenever RequestPool was called. This was leading to high memory usage whenever IPv6 pools were configured with a target subnet size too different from the pools prefix size. For instance: pool = fd00::/8, target size = /64 -- 2 ^ (64-8) subnets would be generated upfront. This would take approx. 9 * 10^18 bits -- way too much for any human computer in 2024. Another noteworthy issue, the previous implementation was allocating a subnet, and then in another layer was checking whether the allocation was conflicting with some 'reserved networks'. If so, the allocation would be retried, etc... To make it worse, 'reserved networks' would be recomputed on every iteration. This is totally ineffective as there could be 'reserved networks' that fully overlap a given address pool (or many!). To fix this issue, a new field `Exclude` is added to `RequestPool`. It's up to each driver to take it into account. Since we don't know whether this retry loop is useful for some remote IPAM driver, it's reimplemented bug-for-bug directly in the remote driver. The new allocator uses a linear-search algorithm. It takes advantage of all lists (predefined pools, allocated subnets and reserved networks) being sorted and logically combines 'allocated' and 'reserved' through a 'double cursor' to iterate on both lists at the same time while preserving the total order. At the same time, it iterates over 'predefined' pools and looks for the first empty space that would be a good fit. Currently, the size of the allocated subnet is still dictated by each 'predefined' pools. We should consider hardcoding that size instead, and let users specify what subnet size they want. This wasn't possible before as the subnets were generated upfront. This new allocator should be able to deal with this easily. The method used for static allocation has been updated to make sure the ascending order of 'allocated' is preserved. It's bug-for-bug compatible with the previous implementation. One consequence of this new algorithm is that we don't keep track of where the last allocation happened, we just allocate the first free subnet we find. Before: - Allocate: 10.0.1.0/24, 10.0.2.0/24 ; Deallocate: 10.0.1.0/24 ; Allocate 10.0.3.0/24. Now, the 3rd allocation would yield 10.0.1.0/24 once again. As it doesn't change the semantics of the allocator, there's no reason to worry about that. Finally, about 'reserved networks'. The heuristics we use are now properly documented. It was discovered that we don't check routes for IPv6 allocations -- this can't be changed because there's no such thing as on-link routes for IPv6. (Kudos to Rob Murray for coming up with the linear-search idea.) Signed-off-by: Albin Kerouanton <albinker@gmail.com>

This normalization process does two things: - Unmap IPv4-mapped IPv6 addrs. This ensures such address pools are part of the IPv4 address space. - Mask the host ID. This was done by newAddrSpace but `splitByIPFamily` is already validating / normalizing the address pools. Signed-off-by: Albin Kerouanton <albinker@gmail.com>

Nothing was validating whether address pools' `base` prefix were larger than the target subnet `size` they're associated to. As such invalid address pools would yield no subnet, the error could go unnoticed. Signed-off-by: Albin Kerouanton <albinker@gmail.com>

corhere

Review WIP; I haven't even finished with address_space.go. I'll be back tomorrow.

corhere · 2024-05-07T20:26:16Z

libnetwork/ipams/defaultipam/address_space.go

+	var last *ipamutils.NetworkToSplit
+	var discarded int
+	for i, imax := 0, len(predefined); i < imax; i++ {
+		p := predefined[i-discarded]
+		if last != nil && last.Overlaps(p.Base) {
+			predefined = slices.Delete(predefined, i-discarded, i-discarded+1)
+			discarded++
+			continue
+		}
+		last = p
+	}


Since the slices package is already being used, may as well take full advantage.

Suggested change

var last *ipamutils.NetworkToSplit

var discarded int

for i, imax := 0, len(predefined); i < imax; i++ {

p := predefined[i-discarded]

if last != nil && last.Overlaps(p.Base) {

predefined = slices.Delete(predefined, i-discarded, i-discarded+1)

discarded++

continue

}

last = p

}

predefined = slices.CompactFunc(predefined, func(last, p *ipamutils.NetworkToSplit) bool {

return last.Overlaps(p.Base)

})

corhere · 2024-05-07T20:38:40Z

libnetwork/ipams/defaultipam/address_space.go

+	for i, allocated := range aSpace.allocated {
+		if nw.Addr().Compare(allocated.Addr()) < 0 {


aSpace.allocated is a sorted slice, which means binary searching is possible. Turn that O(n) search into O(log n) time complexity!

func (aSpace *addrSpace) allocatePool(nw netip.Prefix) error { n, _ := slices.BinarySearchFunc(aSpace.allocated, nw, func(allocated, nw netip.Prefix) int { return nw.Addr().Compare(allocated.Addr()) }) aSpace.allocated = slices.Insert(aSpace.allocated, n, nw) aSpace.subnets[nw] = newPoolData(nw) return nil }

Also, are duplicate allocations allowed? It would be trivial to detect this situation and return an error instead of inserting the duplicate entry into the slice.

corhere · 2024-05-07T20:53:06Z

libnetwork/ipams/defaultipam/address_space.go

+	var partialOverlap bool
+	var prevAlloc netip.Prefix
+
+	slices.SortFunc(reserved, netiputil.Compare)


This code mutates the backing array of the reserved slice, which is owned by the driver. I feel that violates the principle of least astonishment. How about making it part of the IPAM API contract that the reserved slice needs to already be sorted?

corhere

The new allocator should work fine with Swarm. The CNM network allocator replays allocations as static assignments if there is an existing allocation in the Swarm state.

moby/libnetwork/cnmallocator/networkallocator.go

Lines 866 to 873 in 4554d87

    
           // If there is non-nil IPAM state always prefer those subnet 
        
           // configs over Spec configs. 
        
           if n.IPAM != nil { 
        
           	ipamConfigs = n.IPAM.Configs 
        
           } else if n.Spec.IPAM != nil { 
        
           	ipamConfigs = make([]*api.IPAMConfig, len(n.Spec.IPAM.Configs)) 
        
           	copy(ipamConfigs, n.Spec.IPAM.Configs) 
        
           }

corhere · 2024-05-08T17:40:10Z

libnetwork/ipams/defaultipam/structures.go

+	a      []netip.Prefix
+	b      []netip.Prefix


Suggested change

a []netip.Prefix

b []netip.Prefix

a, b []netip.Prefix

corhere · 2024-05-08T17:48:46Z

libnetwork/ipams/defaultipam/structures.go

+// doubleCursor is used to iterate on both 'a' and 'b' at the same time while
+// maintaining the total order that would arise if both were merged and then
+// sorted. Both 'a' and 'b' have to be sorted beforehand.
+type doubleCursor struct {


If it iterates over collections, why not just call it an iterator? Iterators are being added to the language in Go 1.23. And the operation it is performing is more commonly known as "merge." (C++, Rust 1, Rust 2, Go proposal)

Suggested change

// doubleCursor is used to iterate on both 'a' and 'b' at the same time while

// maintaining the total order that would arise if both were merged and then

// sorted. Both 'a' and 'b' have to be sorted beforehand.

type doubleCursor struct {

// mergeIter is used to iterate on both 'a' and 'b' at the same time while

// maintaining the total order that would arise if both were merged and then

// sorted. Both 'a' and 'b' have to be sorted beforehand.

type mergeIter struct {

corhere · 2024-05-08T18:15:49Z

libnetwork/ipams/defaultipam/structures.go

+func (dc *doubleCursor) Inc() {
+	if dc.lastA {
+		dc.ia++
+	} else {
+		dc.ib++
+	}
+}


Calling dc.Inc(); dc.Inc() would misbehave as lastA does not get updated. I worry that (even with documentation) someone would try something like that in the future and shoot themselves in the foot. It doesn't look like it would be too hard to assign to lastA inside Inc() and make Get() a pure function.

corhere · 2024-05-08T18:48:00Z

libnetwork/ipams/defaultipam/address_space.go

+	for i, allocated := range aSpace.allocated {
+		if allocated.Addr().Compare(nw.Addr()) == 0 && allocated.Bits() == nw.Bits() {


Binary search?

corhere · 2024-05-08T18:53:20Z

libnetwork/ipams/defaultipam/allocator.go

 	localAddressPools := ipamutils.GetLocalScopeDefaultNetworks()
 	if len(lAddrPools) > 0 {
-		var err error
-		localAddressPools, err = ipamutils.SplitNetworks(lAddrPools)
-		if err != nil {
-			return err
-		}
+		localAddressPools = lAddrPools
 	}

 	globalAddressPools := ipamutils.GetGlobalScopeDefaultNetworks()
 	if len(gAddrPools) > 0 {
-		var err error
-		globalAddressPools, err = ipamutils.SplitNetworks(gAddrPools)
-		if err != nil {
-			return err
-		}
+		globalAddressPools = gAddrPools
 	}

 	a, err := NewAllocator(localAddressPools, globalAddressPools)


This can be written both more compactly, and also (slightly) more efficiently by not doing unnecessary work when the pools are provided in the arguments.

if len(lAddrPools) == 0 { lAddrPools = ipamutils.GetLocalScopeDefaultNetworks() } if len(gAddrPools) == 0 { gAddrPools = ipamutils.GetGlobalScopeDefaultNetworks() } a, err := NewAllocator(lAddrPools, gAddrPools)

corhere · 2024-05-08T21:40:07Z

libnetwork/netutils/utils_linux.go

+// prefix is found to overlap but users care enough about it being associated
+// to a Docker network they can still rely on static allocation.
+//
+// For IPv6, the 2nd heuristics isn't applied as there's no such thing as


Suggested change

// For IPv6, the 2nd heuristics isn't applied as there's no such thing as

// For IPv6, the 2nd heuristic isn't applied as there's no such thing as

corhere · 2024-05-08T21:47:01Z

libnetwork/netutils/utils_linux.go

+	// We don't really care if os.ReadFile fails here. It either doesn't exist,
+	// or we can't read it for some reason.
+	if rc, err := os.ReadFile(resolvconf.Path()); err == nil {
+		reserved = filterByFamily(resolvconf.GetNameserversAsPrefix(rc), v6)


Suggested change

reserved = filterByFamily(resolvconf.GetNameserversAsPrefix(rc), v6)

reserved = slices.DeleteFunc(resolvconf.GetNameserversAsPrefix(rc), func(p netip.Prefix) bool { return p.Addr().Is6() != v6 })

corhere · 2024-05-08T21:48:48Z

libnetwork/netutils/utils_linux.go

+		reserved = filterByFamily(resolvconf.GetNameserversAsPrefix(rc), v6)
+	}
+
+	return append(reserved, queryOnLinkRoutes(v6)...)


Suggested change

return append(reserved, queryOnLinkRoutes(v6)...)

if !v6 {

reserved = append(reserved, queryOnLinkRoutes()...)

}

corhere · 2024-05-08T21:49:28Z

libnetwork/netutils/utils_linux.go

+// queryOnLinkRoutes returns a list of on-link routes available on the host.
+// Since there's no such thing as on-link routes for IPv6, this function
+// always returns an empty list.
+func queryOnLinkRoutes(v6 bool) []netip.Prefix {


Suggested change

// queryOnLinkRoutes returns a list of on-link routes available on the host.

// Since there's no such thing as on-link routes for IPv6, this function

// always returns an empty list.

func queryOnLinkRoutes(v6 bool) []netip.Prefix {

// queryOnLinkRoutes returns a list of on-link routes available on the host.

// Only IPv4 prefixes are returned as there's no such thing as on-link

// routes for IPv6.

func queryOnLinkRoutes() []netip.Prefix {

corhere · 2024-05-08T22:03:10Z

libnetwork/ipams/defaultipam/allocator.go

 		}

-		if p.Addr().Is4() {
-			v4 = append(v4, p)
+		n.Base = netip.PrefixFrom(n.Base.Addr().Unmap(), n.Base.Bits()).Masked()


Suggested change

n.Base = netip.PrefixFrom(n.Base.Addr().Unmap(), n.Base.Bits()).Masked()

n.Base, _ = n.Base.Addr().Unmap().Prefix(n.Base.Bits())

akerouanton added status/2-code-review kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. area/networking impact/changelog area/networking/ipam labels Apr 26, 2024

akerouanton added this to the 27.0.0 milestone Apr 26, 2024

akerouanton self-assigned this Apr 26, 2024

akerouanton mentioned this pull request Apr 26, 2024

libnet/ipam: Lazily sub-divide pools into subnets #43033

Closed

akerouanton force-pushed the libnet-ipam-linear-allocator branch from aae4040 to 9c6196f Compare April 26, 2024 16:43

corhere reviewed Apr 26, 2024

View reviewed changes

libnetwork/internal/netiputil/netiputil.go Outdated Show resolved Hide resolved

akerouanton mentioned this pull request Apr 26, 2024

libnet/ipam: Various clean-ups #47727

Merged

akerouanton force-pushed the libnet-ipam-linear-allocator branch from 9c6196f to 59e3d2a Compare April 26, 2024 20:44

akerouanton requested a review from robmry April 26, 2024 20:49

akerouanton marked this pull request as ready for review April 29, 2024 11:49

robmry approved these changes Apr 29, 2024

View reviewed changes

akerouanton added 3 commits May 7, 2024 08:50

akerouanton force-pushed the libnet-ipam-linear-allocator branch from 59e3d2a to 304e175 Compare May 7, 2024 06:57

akerouanton requested a review from robmry May 7, 2024 06:57

robmry approved these changes May 7, 2024

View reviewed changes

corhere reviewed May 7, 2024

View reviewed changes

corhere reviewed May 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libnet/ipams/default: introduce a linear allocator #47768

libnet/ipams/default: introduce a linear allocator #47768

akerouanton commented Apr 26, 2024 •

edited

robmry left a comment

robmry Apr 29, 2024

akerouanton commented Apr 30, 2024

corhere left a comment

corhere May 7, 2024

corhere May 7, 2024

corhere May 7, 2024

corhere left a comment

corhere May 8, 2024

corhere May 8, 2024

corhere May 8, 2024

corhere May 8, 2024

corhere May 8, 2024

corhere May 8, 2024

corhere May 8, 2024

corhere May 8, 2024

corhere May 8, 2024

corhere May 8, 2024

		for i, allocated := range aSpace.allocated {
		if nw.Addr().Compare(allocated.Addr()) < 0 {

	// If there is non-nil IPAM state always prefer those subnet
	// configs over Spec configs.
	if n.IPAM != nil {
	ipamConfigs = n.IPAM.Configs
	} else if n.Spec.IPAM != nil {
	ipamConfigs = make([]*api.IPAMConfig, len(n.Spec.IPAM.Configs))
	copy(ipamConfigs, n.Spec.IPAM.Configs)
	}

		for i, allocated := range aSpace.allocated {
		if allocated.Addr().Compare(nw.Addr()) == 0 && allocated.Bits() == nw.Bits() {

	// For IPv6, the 2nd heuristics isn't applied as there's no such thing as
	// For IPv6, the 2nd heuristic isn't applied as there's no such thing as

	reserved = filterByFamily(resolvconf.GetNameserversAsPrefix(rc), v6)
	reserved = slices.DeleteFunc(resolvconf.GetNameserversAsPrefix(rc), func(p netip.Prefix) bool { return p.Addr().Is6() != v6 })

-	return append(reserved, queryOnLinkRoutes(v6)...)
+	if !v6 {
+		reserved = append(reserved, queryOnLinkRoutes()...)
+	}

	n.Base = netip.PrefixFrom(n.Base.Addr().Unmap(), n.Base.Bits()).Masked()
	n.Base, _ = n.Base.Addr().Unmap().Prefix(n.Base.Bits())

libnet/ipams/default: introduce a linear allocator #47768

Are you sure you want to change the base?

libnet/ipams/default: introduce a linear allocator #47768

Conversation

akerouanton commented Apr 26, 2024 • edited

robmry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akerouanton commented Apr 30, 2024

corhere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

corhere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akerouanton commented Apr 26, 2024 •

edited