Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set temporary single CPU affinity before cgroup cpuset transition. #3923

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

cclerget
Copy link
Contributor

This handles a corner case when joining a container having all the processes running exclusively on isolated CPU cores to force the kernel to schedule runc process on the first CPU core within the cgroups cpuset.

The introduction of the kernel commit
46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic scheduling behavior by distributing tasks across CPU cores within the cgroups cpuset. Some intensive real-time application are relying on this deterministic behavior and use the first CPU core to run a slow thread while other CPU cores are fully used by real-time threads with SCHED_FIFO policy. Such applications prevents runc process from joining a container when the runc process is randomly scheduled on a CPU core owned by a real-time thread.

@kolyshkin kolyshkin marked this pull request as draft June 30, 2023 20:49
@kolyshkin
Copy link
Contributor

Thanks for working on this. I changed it to draft until all the issues with the code and test cases are fixed, and left some minor comments.

@cclerget cclerget force-pushed the issue-3922 branch 16 times, most recently from 85f2d35 to 3e05b1c Compare July 3, 2023 14:01
@cclerget
Copy link
Contributor Author

cclerget commented Jul 3, 2023

Thanks @kolyshkin , it should be ready for review now

@cclerget cclerget marked this pull request as ready for review July 3, 2023 14:01
@lifubang
Copy link
Member

lifubang commented Jul 7, 2023

@cclerget Please rebase

@cclerget
Copy link
Contributor Author

@lifubang Done

}

// Iterates until it goes to the cgroup root path.
for path := filepath.Clean(cpusetPath); path != defaultCgroupRoot; path = filepath.Dir(path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK filepath.Clean is not needed here because m.Path returns cleaned path.

return ""
}

// Iterates until it goes to the cgroup root path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe makes sense to add something like "needed for containers in which cpuset controller is not enabled -- in this case a parent cgroup is used" -- if my understanding is correct.

Comment on lines +161 to +162
// Close the pipe to not be blocked in the parent.
p.comm.closeChild()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a defer statement at the very beginning of this function -- isn't it enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enough, the defer closes the parent side of the pipe, we also need to close the child side otherwise the process get stuck

Comment on lines +174 to +175
// Close the pipe to not be blocked in the parent.
p.comm.closeChild()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

// Use a goroutine to dedicate an OS thread.
go func() {
cpuSet := new(unix.CPUSet)
cpuSet.Zero()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this is not needed, in Go everything is initialized to default values (0s in this case), and you've just instantiated a new CPUSet.

@@ -340,3 +340,168 @@ EOF
[ ${#lines[@]} -eq 1 ]
[[ ${lines[0]} = *"exec /run.sh: no such file or directory"* ]]
}

@test "runc exec with isolated cpus affinity temporary transition [cgroup cpuset]" {
requires root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add cgroups_cpuset to the requires list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other tests

Comment on lines 349 to 359
local all_cpus
all_cpus="$(cat /sys/devices/system/cpu/online)"

update_config ".linux.resources.cpu.cpus = \"$all_cpus\""

# set temporary isolated CPU affinity transition
update_config '.annotations += {"org.opencontainers.runc.exec.isolated-cpu-affinity-transition": "temporary"}'

local mems
mems="$(cat /sys/devices/system/node/online 2>/dev/null || true)"
[[ -n $mems ]] && update_config ".linux.resources.cpu.mems = \"$mems\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can separate this into a function (at least the mems and all_cpus part).

Comment on lines 413 to 414
# fix unbound variable in condition below
PLATFORM_ID=${PLATFORM_ID:-}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you use VERSION_ID instead, it looks easier?

@cclerget
Copy link
Contributor Author

cclerget commented Apr 3, 2024

@kolyshkin addressed your comments in 9eb05cc, will squash the commits after approval

The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76
in 5.7 has affected a deterministic scheduling behavior by distributing tasks
across CPU cores within a cgroups cpuset. It means that some runc operations
like `runc exec` might be impacted under some circumstances, by example when
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a review, just a question (but maybe you'll decide to clarify this further). Without looking at the code, and only at this piece of documentation (and at the commit message):

  • will this only improve the behavior of threads launched with runc exec? (if so, then the documentation should not be "some runc operations like")
  • otherwise, what other commands / situations / runc operations will benefit from this patch? (i.e., in which cases do I want to annotate my pods with this annotation to revert to the behavior pre 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76). The documentation so far only mentions runc exec
  • related to the above, will this change the behavior of any new process spawned in the container (an example would be a DPDK application using the vhost device to create kernel vhost threads, would those still float around freely, or be sent to the first CPU with the annotation in place - in this scenario, no runc exec session is involved)
  • adjust the commit message also, because "This handles a corner case when joining a container having all
    the processes running exclusively on isolated CPU cores to force
    the kernel to schedule runc process on the first CPU core within the
    cgroups cpuset." sounds ambiguous to me? "joining" as in connecting to the container with an exec session? (because joining could also mean "joining something together")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this only improve the behavior of threads launched with runc exec? (if so, then the documentation should not be "some runc operations like")

Yes it does affect runc exec operation only, I will fix that part, thanks !

otherwise, what other commands or situations will benefit from this patch? (i.e., in which cases do I want to annotate my pods with this annotation to revert to the behavior pre 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76)

No other commands will benefit from this patch.
For situations, by example when kubernetes is configured with a static policy (https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy) and --reserved-cpus contains the isolated CPUs, all pods running with resources limits/requests set to the same number of CPUs >= 1 will be granted X exclusive isolated CPUs, with such setup and pod spec, a container real-time application running on those isolated CPUs could create threads with SCHED_FIFO policy except on the first isolated CPU, such that things like kubernetes exec probes or kubectl exec will benefit from this patch to use the first isolated CPU without interfering with the real-time application threads. You can look at the original issue #3922 to have more context.

related to the above, will this change the behavior of any new process spawned in the container (an example would be a DPDK application using the vhost device to create kernel vhost threads, would those still float around freely, or be sent to the first CPU with the annotation in place

It won't change the behavior for the new processes, only processes spawned through runc exec are impacted by this annotation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for the clarification and thanks for your work on this!

@cclerget
Copy link
Contributor Author

cclerget commented Apr 5, 2024

@kolyshkin changes ok with you ?

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; please squash the commits

This handles a corner case when joining a container having all
the processes running exclusively on isolated CPU cores to force
the kernel to schedule runc process on the first CPU core within the
cgroups cpuset.

The introduction of the kernel commit
46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic
scheduling behavior by distributing tasks across CPU cores within the
cgroups cpuset. Some intensive real-time application are relying on this
deterministic behavior and use the first CPU core to run a slow thread
while other CPU cores are fully used by real-time threads with SCHED_FIFO
policy. Such applications prevents runc process from joining a container
when the runc process is randomly scheduled on a CPU core owned by a
real-time thread.

Introduces isolated CPU affinity transition OCI runtime annotation
org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore
the behavior during runc exec.

Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes.

Signed-off-by: Cédric Clerget <cedric.clerget@gmail.com>
@cclerget
Copy link
Contributor Author

Done thanks !

@kolyshkin kolyshkin merged commit 6a2813f into opencontainers:main Apr 16, 2024
38 checks passed
@MatthewHink
Copy link

Thank goodness this is finally merged! Really appreciate your help everyone!

@kolyshkin
Copy link
Contributor

The more I look into this the more I think I did a bad job reviewing this, and it needs to be redone in a different way:

  1. There's too much logic here figuring out which CPUs to use. Runc is a low level tool and is not supposed to be that "smart".
  2. What's worse, this logic is executed on every exec, making it slower.
  3. Some of the logic in (*setnsProcess).start is executed even if no annotation is set, thus making ALL execs slow.
  4. As pointed out in config: add annotation for exec isolated CPU affinity runtime-spec#1252, if we want to support this across different runtimes (e.g. crun and runc), this should not be an annotation, but rather a process parameter.

Let's fix this:

  1. Revert this PR.
  2. Move some of the functionality (determining which CPU to pin exec to) to upper level runtime (cri-o/containerd).
  3. Open a PR in runtime-spec proposing the changes to runc/crun.
  4. Open a PR to runc implementing that proposal.

@NeilHanlon
Copy link

while I recognize that this is a complex issue, I am unsure if reverting this is the best path forward, considering implementations of this are already deployed and in use.

This review has taken an exceedingly long time and we are now on the cusp of even more review.

It would be disappointing if we have to start this whole process over again.

@kolyshkin
Copy link
Contributor

while I recognize that this is a complex issue, I am unsure if reverting this is the best path forward, considering implementations of this are already deployed and in use.

This review has taken an exceedingly long time and we are now on the cusp of even more review.

It would be disappointing if we have to start this whole process over again.

I have to admit I did a sloppy job reviewing this; yet this is not in any of the released runc versions (and this is why it needs to be reverted now, before we officially release it).

Now, this has to be re-implemented in the right way, starting from runtime-spec (see opencontainers/runtime-spec#1253), then in runtimes (such as cri-o and containerd), when in low level runtimes (such as runc and crun).

Any help (esp in cri-o and containerd) is appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue joining cgroups cpuset with kernel scheduler task "random" distribution
10 participants