Set temporary single CPU affinity before cgroup cpuset transition. #3923

cclerget · 2023-06-30T13:52:25Z

This handles a corner case when joining a container having all the processes running exclusively on isolated CPU cores to force the kernel to schedule runc process on the first CPU core within the cgroups cpuset.

The introduction of the kernel commit
46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic scheduling behavior by distributing tasks across CPU cores within the cgroups cpuset. Some intensive real-time application are relying on this deterministic behavior and use the first CPU core to run a slow thread while other CPU cores are fully used by real-time threads with SCHED_FIFO policy. Such applications prevents runc process from joining a container when the runc process is randomly scheduled on a CPU core owned by a real-time thread.

Fixes Issue joining cgroups cpuset with kernel scheduler task "random" distribution #3922

libcontainer/container_linux.go

libcontainer/nsenter/nsexec.c

libcontainer/container_linux_test.go

kolyshkin · 2023-06-30T20:50:35Z

Thanks for working on this. I changed it to draft until all the issues with the code and test cases are fixed, and left some minor comments.

cclerget · 2023-07-03T14:01:47Z

Thanks @kolyshkin , it should be ready for review now

lifubang · 2023-07-07T09:35:34Z

@cclerget Please rebase

cclerget · 2023-07-11T08:43:38Z

@lifubang Done

kolyshkin · 2024-04-02T21:34:45Z

libcontainer/cgroups/fs/fs.go

+	}
+
+	// Iterates until it goes to the cgroup root path.
+	for path := filepath.Clean(cpusetPath); path != defaultCgroupRoot; path = filepath.Dir(path) {


AFAIK filepath.Clean is not needed here because m.Path returns cleaned path.

kolyshkin · 2024-04-02T21:38:25Z

libcontainer/cgroups/fs/fs.go

+		return ""
+	}
+
+	// Iterates until it goes to the cgroup root path.


Maybe makes sense to add something like "needed for containers in which cpuset controller is not enabled -- in this case a parent cgroup is used" -- if my understanding is correct.

kolyshkin · 2024-04-02T21:44:51Z

libcontainer/process_linux.go

+			// Close the pipe to not be blocked in the parent.
+			p.comm.closeChild()


We have a defer statement at the very beginning of this function -- isn't it enough?

Not enough, the defer closes the parent side of the pipe, we also need to close the child side otherwise the process get stuck

kolyshkin · 2024-04-02T21:45:12Z

libcontainer/process_linux.go

+				// Close the pipe to not be blocked in the parent.
+				p.comm.closeChild()


kolyshkin · 2024-04-02T21:51:28Z

libcontainer/process_linux.go

+	// Use a goroutine to dedicate an OS thread.
+	go func() {
+		cpuSet := new(unix.CPUSet)
+		cpuSet.Zero()


AFAIK this is not needed, in Go everything is initialized to default values (0s in this case), and you've just instantiated a new CPUSet.

kolyshkin · 2024-04-02T21:56:05Z

tests/integration/exec.bats

@@ -340,3 +340,168 @@ EOF
 	[ ${#lines[@]} -eq 1 ]
 	[[ ${lines[0]} = *"exec /run.sh: no such file or directory"* ]]
 }
+
+@test "runc exec with isolated cpus affinity temporary transition [cgroup cpuset]" {
+	requires root


need to add cgroups_cpuset to the requires list.

Same for other tests

tests/integration/exec.bats

kolyshkin · 2024-04-02T22:01:32Z

tests/integration/exec.bats

+	local all_cpus
+	all_cpus="$(cat /sys/devices/system/cpu/online)"
+
+	update_config ".linux.resources.cpu.cpus = \"$all_cpus\""
+
+	# set temporary isolated CPU affinity transition
+	update_config '.annotations += {"org.opencontainers.runc.exec.isolated-cpu-affinity-transition": "temporary"}'
+
+	local mems
+	mems="$(cat /sys/devices/system/node/online 2>/dev/null || true)"
+	[[ -n $mems ]] && update_config ".linux.resources.cpu.mems = \"$mems\""


Perhaps you can separate this into a function (at least the mems and all_cpus part).

kolyshkin · 2024-04-02T22:08:26Z

tests/integration/exec.bats

+	# fix unbound variable in condition below
+	PLATFORM_ID=${PLATFORM_ID:-}


nit: can you use VERSION_ID instead, it looks easier?

cclerget · 2024-04-03T08:24:43Z

@kolyshkin addressed your comments in 9eb05cc, will squash the commits after approval

andreaskaris · 2024-04-03T08:54:39Z

docs/isolated-cpu-affinity-transition.md

+The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76
+in 5.7 has affected a deterministic scheduling behavior by distributing tasks
+across CPU cores within a cgroups cpuset. It means that some runc operations
+like `runc exec` might be impacted under some circumstances, by example when


Not a review, just a question (but maybe you'll decide to clarify this further). Without looking at the code, and only at this piece of documentation (and at the commit message):

will this only improve the behavior of threads launched with runc exec? (if so, then the documentation should not be "some runc operations like")

otherwise, what other commands / situations / runc operations will benefit from this patch? (i.e., in which cases do I want to annotate my pods with this annotation to revert to the behavior pre 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76). The documentation so far only mentions runc exec

related to the above, will this change the behavior of any new process spawned in the container (an example would be a DPDK application using the vhost device to create kernel vhost threads, would those still float around freely, or be sent to the first CPU with the annotation in place - in this scenario, no runc exec session is involved)

adjust the commit message also, because "This handles a corner case when joining a container having all
the processes running exclusively on isolated CPU cores to force
the kernel to schedule runc process on the first CPU core within the
cgroups cpuset." sounds ambiguous to me? "joining" as in connecting to the container with an exec session? (because joining could also mean "joining something together")

will this only improve the behavior of threads launched with runc exec? (if so, then the documentation should not be "some runc operations like")

Yes it does affect runc exec operation only, I will fix that part, thanks !

otherwise, what other commands or situations will benefit from this patch? (i.e., in which cases do I want to annotate my pods with this annotation to revert to the behavior pre 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76)

No other commands will benefit from this patch.
For situations, by example when kubernetes is configured with a static policy (https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy) and --reserved-cpus contains the isolated CPUs, all pods running with resources limits/requests set to the same number of CPUs >= 1 will be granted X exclusive isolated CPUs, with such setup and pod spec, a container real-time application running on those isolated CPUs could create threads with SCHED_FIFO policy except on the first isolated CPU, such that things like kubernetes exec probes or kubectl exec will benefit from this patch to use the first isolated CPU without interfering with the real-time application threads. You can look at the original issue #3922 to have more context.

related to the above, will this change the behavior of any new process spawned in the container (an example would be a DPDK application using the vhost device to create kernel vhost threads, would those still float around freely, or be sent to the first CPU with the annotation in place

It won't change the behavior for the new processes, only processes spawned through runc exec are impacted by this annotation.

Awesome, thanks for the clarification and thanks for your work on this!

cclerget · 2024-04-05T06:35:13Z

@kolyshkin changes ok with you ?

kolyshkin

LGTM; please squash the commits

This handles a corner case when joining a container having all the processes running exclusively on isolated CPU cores to force the kernel to schedule runc process on the first CPU core within the cgroups cpuset. The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic scheduling behavior by distributing tasks across CPU cores within the cgroups cpuset. Some intensive real-time application are relying on this deterministic behavior and use the first CPU core to run a slow thread while other CPU cores are fully used by real-time threads with SCHED_FIFO policy. Such applications prevents runc process from joining a container when the runc process is randomly scheduled on a CPU core owned by a real-time thread. Introduces isolated CPU affinity transition OCI runtime annotation org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore the behavior during runc exec. Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes. Signed-off-by: Cédric Clerget <cedric.clerget@gmail.com>

cclerget · 2024-04-16T09:39:04Z

Done thanks !

MatthewHink · 2024-04-16T20:53:23Z

Thank goodness this is finally merged! Really appreciate your help everyone!

kolyshkin · 2024-05-18T00:20:24Z

The more I look into this the more I think I did a bad job reviewing this, and it needs to be redone in a different way:

There's too much logic here figuring out which CPUs to use. Runc is a low level tool and is not supposed to be that "smart".
What's worse, this logic is executed on every exec, making it slower.
Some of the logic in (*setnsProcess).start is executed even if no annotation is set, thus making ALL execs slow.
As pointed out in config: add annotation for exec isolated CPU affinity runtime-spec#1252, if we want to support this across different runtimes (e.g. crun and runc), this should not be an annotation, but rather a process parameter.

Let's fix this:

Revert this PR.
Move some of the functionality (determining which CPU to pin exec to) to upper level runtime (cri-o/containerd).
Open a PR in runtime-spec proposing the changes to runc/crun.
Open a PR to runc implementing that proposal.

NeilHanlon · 2024-05-21T02:37:39Z

while I recognize that this is a complex issue, I am unsure if reverting this is the best path forward, considering implementations of this are already deployed and in use.

This review has taken an exceedingly long time and we are now on the cusp of even more review.

It would be disappointing if we have to start this whole process over again.

kolyshkin · 2024-05-22T01:06:58Z

while I recognize that this is a complex issue, I am unsure if reverting this is the best path forward, considering implementations of this are already deployed and in use.

This review has taken an exceedingly long time and we are now on the cusp of even more review.

It would be disappointing if we have to start this whole process over again.

I have to admit I did a sloppy job reviewing this; yet this is not in any of the released runc versions (and this is why it needs to be reverted now, before we officially release it).

Now, this has to be re-implemented in the right way, starting from runtime-spec (see opencontainers/runtime-spec#1253), then in runtimes (such as cri-o and containerd), when in low level runtimes (such as runc and crun).

Any help (esp in cri-o and containerd) is appreciated.

cclerget force-pushed the issue-3922 branch from 4477139 to 2df8f93 Compare June 30, 2023 14:39

kolyshkin reviewed Jun 30, 2023

View reviewed changes

libcontainer/container_linux.go Outdated Show resolved Hide resolved

kolyshkin reviewed Jun 30, 2023

View reviewed changes

libcontainer/container_linux.go Outdated Show resolved Hide resolved

kolyshkin reviewed Jun 30, 2023

View reviewed changes

libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved

kolyshkin reviewed Jun 30, 2023

View reviewed changes

libcontainer/container_linux_test.go Outdated Show resolved Hide resolved

kolyshkin marked this pull request as draft June 30, 2023 20:49

cclerget force-pushed the issue-3922 branch 16 times, most recently from 85f2d35 to 3e05b1c Compare July 3, 2023 14:01

cclerget marked this pull request as ready for review July 3, 2023 14:01

lifubang added the status/needs-rebase label Jul 8, 2023

cclerget force-pushed the issue-3922 branch from 3e05b1c to 52fd5d8 Compare July 11, 2023 08:29

cclerget force-pushed the issue-3922 branch from 52fd5d8 to dc43652 Compare July 14, 2023 06:19

AkihiroSuda requested a review from kolyshkin April 2, 2024 14:33

kolyshkin reviewed Apr 2, 2024

View reviewed changes

tests/integration/exec.bats Show resolved Hide resolved

kolyshkin reviewed Apr 2, 2024

View reviewed changes

kolyshkin added this to the 1.2.0 milestone Apr 2, 2024

cclerget force-pushed the issue-3922 branch from 1c43584 to ea7b6c0 Compare April 3, 2024 08:23

andreaskaris reviewed Apr 3, 2024

View reviewed changes

cclerget force-pushed the issue-3922 branch from ea7b6c0 to ea43f8e Compare April 3, 2024 09:42

cclerget force-pushed the issue-3922 branch from ea43f8e to 9eb05cc Compare April 15, 2024 08:09

kolyshkin approved these changes Apr 16, 2024

View reviewed changes

cclerget force-pushed the issue-3922 branch from 9eb05cc to afc23e3 Compare April 16, 2024 06:59

kolyshkin merged commit 6a2813f into opencontainers:main Apr 16, 2024
38 checks passed

kolyshkin added impact/changelog status/4-merge and removed status/2-code-review labels Apr 16, 2024

kolyshkin mentioned this pull request May 18, 2024

Revert "Set temporary single CPU affinity before cgroup cpuset transition" #4283

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set temporary single CPU affinity before cgroup cpuset transition. #3923

Set temporary single CPU affinity before cgroup cpuset transition. #3923

cclerget commented Jun 30, 2023

kolyshkin commented Jun 30, 2023

cclerget commented Jul 3, 2023

lifubang commented Jul 7, 2023

cclerget commented Jul 11, 2023

kolyshkin Apr 2, 2024

kolyshkin Apr 2, 2024

kolyshkin Apr 2, 2024

cclerget Apr 3, 2024

kolyshkin Apr 2, 2024

cclerget Apr 3, 2024

kolyshkin Apr 2, 2024

kolyshkin Apr 2, 2024

kolyshkin Apr 2, 2024

kolyshkin Apr 2, 2024

kolyshkin Apr 2, 2024

cclerget commented Apr 3, 2024 •

edited

andreaskaris Apr 3, 2024 •

edited

cclerget Apr 3, 2024

andreaskaris Apr 3, 2024

cclerget commented Apr 5, 2024

kolyshkin left a comment

cclerget commented Apr 16, 2024

MatthewHink commented Apr 16, 2024

kolyshkin commented May 18, 2024

NeilHanlon commented May 21, 2024

kolyshkin commented May 22, 2024

		// Close the pipe to not be blocked in the parent.
		p.comm.closeChild()

		# fix unbound variable in condition below
		PLATFORM_ID=${PLATFORM_ID:-}

Set temporary single CPU affinity before cgroup cpuset transition. #3923

Set temporary single CPU affinity before cgroup cpuset transition. #3923

Conversation

cclerget commented Jun 30, 2023

kolyshkin commented Jun 30, 2023

cclerget commented Jul 3, 2023

lifubang commented Jul 7, 2023

cclerget commented Jul 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cclerget commented Apr 3, 2024 • edited

andreaskaris Apr 3, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cclerget commented Apr 5, 2024

kolyshkin left a comment

Choose a reason for hiding this comment

cclerget commented Apr 16, 2024

MatthewHink commented Apr 16, 2024

kolyshkin commented May 18, 2024

NeilHanlon commented May 21, 2024

kolyshkin commented May 22, 2024

cclerget commented Apr 3, 2024 •

edited

andreaskaris Apr 3, 2024 •

edited