New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the error of runc doesn't work with go1.22 #4193
base: main
Are you sure you want to change the base?
Conversation
go 1.22.0 error msg:
|
This can be fixed by adding |
Interestingly, both runc 1.1.12 and runc from git HEAD built with go1.22.0 work fine on my machine (all tests are passing). |
This comment was marked as outdated.
This comment was marked as outdated.
We also need to fix this for Go 1.22
I don't remember why I haven't switched to |
It seems that cgo may be broken with clone(2) in go1.22.0? |
Again, I can't repro locally. [kir@kir-tp1 cgoclone2]$ go version
go version go1.21.6 linux/amd64
[kir@kir-tp1 cgoclone2]$ go run main.go
STAGE_PARENT
STAGE_CHILD
STAGE_INIT
This from nsexec
From main!
[kir@kir-tp1 cgoclone2]$ go1.22.0 version
go version go1.22.0 linux/amd64
[kir@kir-tp1 cgoclone2]$ go1.22.0 run main.go
STAGE_PARENT
STAGE_CHILD
STAGE_INIT
This from nsexec
From main! Maybe it's your kernel version @lifubang? Can you show |
@lifubang also if you can repro that (alas I can not), you can git bisect golang between 1.21.0 and 1.22.0. |
Note in CI it happens with Ubuntu 20.04 but not Ubuntu 22.04. Will try to repro in a VM. |
On Ubuntu 20.04, when running the binary compiled with go 1.22, I am seeing a SIGSEGV:
Can't yet figure out what's going on there; will continue tomorrow. |
@lifubang I did a bisect, here are the results: golang/go#65625 (comment) Will continue tomorrow. |
So, to summarize the investigation done there -- it's a glibc bug, in fact, two bugs:
These two bugs are apparently specific to glibc used by Ubuntu 20.04 (libc6 2.31-0ubuntu9.14) and maybe also Debian 10 (libc6 2.28-10+deb10u2), as I was able to reproduce on both. With Debian 10, it even prints error from free: For some reason I was unable to repro on older Fedora (F32, glibc-2.31-2.fc32, F33, glibc-2.32-10.fc33) and Debian 11 (libc6 2.31-7). The bad news is, every version of glibc has the bug 1 above, and https://go-review.googlesource.com/c/go/+/563379 may make it so go 1.22.x will fail runc init on every version of glibc. Meaning, we need a workaround for that. Perhaps changing runc libct/nsenter logic in some radical way, so that |
Go 1.22 currently causes crashes on older Debian/Ubuntu systems. lxc/incus#497 golang/go#65625 opencontainers/runc#4193 Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
Go 1.22 currently causes crashes on older Debian/Ubuntu systems. lxc/incus#497 golang/go#65625 opencontainers/runc#4193 Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
👍 |
Rebasing this to re-run with Go 1.22.1 |
c54384f
to
5907889
Compare
Sorry @lifubang I've high-jacked your PR, needed to run it with Go 1.22.1 and added missing changes to |
OK, Go 1.22.1 makes no difference. I guess we have to disable Go 1.22 for now. |
3144923
to
aebd5b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NACK.
I don't see how switching to fork
will fix the actual issue. The problem is that pthread_self()
will return stale data after a clone(2)
-- can you point to the code in glibc which ensures that the data is valid? Does this patch fix the C version of the buggy program?
Yes, the original Go debugging attempt concluded the issue was CLONE_PARENT
but as I describe in #4233 the issue is more likely to be that we are breaking the rules of signal-safety(7)
by running non async-signal-unsafe
code (i.e. the Go runtime) in the child of a fork
or clone
. This results in stale thread-local data. So switching to fork
won't help.
IMHO, the correct fix is to add an execve
at the end of nsexec()
which re-execs runc
(say runc init-go
) and then change the Go side of runc init
to start with runc init-go
or whatever. This will ensure that we reset the memory state after a clone
and no longer violate signal-safety(7)
.
|
||
// Tell the kernel that runc wants to reap orphaned children of the | ||
// `runc init` process. | ||
if err := unix.Prctl(unix.PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with the entire approach, but if you want to do this you need to have add a waitid(-1)
thread somewhere to make sure you actually reap the zombies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been waited the third init process before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runc/libcontainer/process_linux.go
Lines 406 to 411 in 09eacaa
process, err := os.FindProcess(childPid) | |
if err != nil { | |
return err | |
} | |
p.cmd.Process = process | |
p.process.ops = p |
runc/libcontainer/process_linux.go
Line 279 in 09eacaa
_, _ = p.wait() |
runc/libcontainer/process_linux.go
Line 328 in 09eacaa
if _, werr := p.wait(); err == nil { |
runc/libcontainer/process_linux.go
Line 801 in 09eacaa
if _, werr := p.wait(); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.wait
doesn't do a waitid(-1)
AFAIK, it only waits for the specific child process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we need waitpid(-1)
here, we should not only wait runc init
, but also processes created by runc init
, for example, newuidmap
, newgidmap
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but also processes created by runc init, for example, newuidmap, newgidmap etc.
try_mapping_tool
already does a waitpid
. Now that I think about it, we should do the wait in nsexec.c
and not use PR_SET_CHILD_SUBREAPER
(see my below comment). Setting PR_SET_CHILD_SUBREAPER
implicitly is just asking for trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try_mapping_tool
already does awaitpid
.
Yes, but if stage-0 process suddenly dead, I think runc should reap all these process.
As we all know, PR_SET_CHILD_SUBREAPER means that runc has the ability to reap grand child processes, but if the child process has not exited, runc will not reap it's grand child process.
I think the only problem is that we should tell the libct/nsexec users to add unix.Prctl(unix.PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0) in their system like in runc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Right, stage-2 is not going to be a child of the original runc so we can't wait for it there... That's unfortunate...
The reason I don't like adding PR_SET_CHILD_SUBREAPER
is that it's a process-wide setting so other users of nsenter
(for example, as a server spawning containers) that spawn other processes will now have to deal with subreaper semantics (and depending on their design they might really not want that).
(And, given that this only fixes the surface-level issue with glibc it feels overkill to do PR_SET_CHILD_SUBREAPER
... I suspect we will eventually need to work around other issues and will end up doing an execve
so we might as well rip that band-aid off now...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further think, except the three runc init
process, what other processes should be reaped by runc process?
So there are two questions need to consider:
- Do we really need
waitid(-1)
? Although addwaitid(-1)
has no hurt; - Do we need to open a new go routine to do reap before container terminated?
This comment was marked as outdated.
This comment was marked as outdated.
09eacaa
to
6052007
Compare
https://public-inbox.org/libc-alpha/c432ff40-edf2-1ebe-f3a7-de76fbfdd252@redhat.com/ |
For those reading at home, this is bminor/glibc@c579f48. So, they removed the PID cache entirely and removed their explicit TID storage for After looking into it a bit more, the reason So, the issue is not with There is a way for us to keep using I guess
For reference, I think the following test is better for checking the root cause of the issue (is the cached TID wrong?). typedef struct pthread {
/* Based on output from gdb -- this might change on different machines. */
char __padding[720];
pid_t tid;
/* ... snip ... */
} pthread;
void __attribute__((constructor)) init(void)
{
nsexec();
pthread_t self = pthread_self();
pthread *THREAD_SELF = (pthread *)self;
printf("cached tid: %d ; actual tid: %d\n", THREAD_SELF->tid, gettid());
pthread_attr_t attr;
int ret = pthread_getattr_np(self, &attr);
if (ret != 0) {
printf("pthread_getattr_np: %s\n", strerror(ret));
/* Try to destroy attr anyway. Bad idea, because getattr fails, but this is what Go does. */
pthread_attr_destroy(&attr);
abort();
}
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like doing it with fork(2)
is a better idea for the moment. I still think we need to switch to execve
to make sure this is safe though...
As for the changes themselves, there are a couple of cleanups that we can do now -- and we shouldn't use PR_SET_CHILD_SUBREAPER
like this, instead we should just do a waitpid
in the stage-0 of nsexec.c
.
|
||
// Tell the kernel that runc wants to reap orphaned children of the | ||
// `runc init` process. | ||
if err := unix.Prctl(unix.PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but also processes created by runc init, for example, newuidmap, newgidmap etc.
try_mapping_tool
already does a waitpid
. Now that I think about it, we should do the wait in nsexec.c
and not use PR_SET_CHILD_SUBREAPER
(see my below comment). Setting PR_SET_CHILD_SUBREAPER
implicitly is just asking for trouble.
libcontainer/process_linux.go
Outdated
err := p.cmd.Process.Kill() | ||
if _, werr := p.wait(); err == nil { | ||
|
||
if _, werr := unix.Wait4(-1, nil, 0, nil); werr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, I don't think we should to do this here -- we do need to add a wait
in stage-0 instead. At the moment we send stage1_pid
over the pipe, but instead we can add the wait after we finish the "stage-1 synchronisation loop" and remove the stage1_pid
JSON logic entirely.
Using PR_SET_CHILD_SUBREAPER
can lead to some other issues (other users of libcontainer that have other children will have unexpected reparenting behaviour, and if we do a wait here we might wait for a different child than we wanted).
(Also if we're waiting for more than one process you need to do wait in a loop -- and if you want to be sure you got all of them you need to do it until -ECHILD
. But there might be other children so we can't do this and it would lead to all sorts of other issues...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do need to add a
wait
in stage-0 instead.
There are two problems:
- If the
stage-0
suddenly dead, how to reapstage-1
andstage-2
? - For
runc exec -t
, thestage-0
process has exited after the container started, if we wait it instage-0
, runc process will have no way to know whether the container has exited or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if we're waiting for more than one process you need to do wait in a loop -- and if you want to be sure you got all of them you need to do it until
-ECHILD
.
Yes, thanks.
try_mapping_tool already does a waitpid.
Yes, but if stage-0
process suddenly dead, I think runc should reap all these process.
As we all know, PR_SET_CHILD_SUBREAPER
means that runc has the ability to reap grand child processes, but if the child process has not exited, runc will not reap it's grand child process.
I think the only problem is that we should tell the libct/nsexec
users to add unix.Prctl(unix.PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0)
in their system like in runc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://grep.app/search?q=github.com/opencontainers/runc/libcontainer/nsenter&filter[lang][0]=Go
It seems that there are 5 public repos using libct/nsenter
.
c8f39a3
to
7eac8b8
Compare
👍
But maybe this will cause the issues like |
We still have full capabilities at the beginning of stage-2 (both with and without user namespaces) and haven't applied any LSM labels or anything like that. I wouldn't expect there to be any issues. |
This comment was marked as outdated.
This comment was marked as outdated.
Do you think moving the c code in |
I think maybe no, because after clone(2), it has already in go routine, it can’t longjmp to C. So, it seems that there is no other way to fix this issue? |
Signed-off-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
As the description in opencontainers#4233, there is a bug in glibc, pthread_self() will return wrong info after we do `clone(CLONE_PARENT)` in libct/nsenter, it will cause runc can't work in `go 1.22.*`. So we use fork(2) to replace clone(2) in libct/nsenter, but there is a double-fork in nsenter, so we need to use `PR_SET_CHILD_SUBREAPER` to let runc can reap grand child process in libct/nsenter. Signed-off-by: lifubang <lifubang@acmcoder.com>
This reverts commit ac31da6. Signed-off-by: lifubang <lifubang@acmcoder.com>
This reverts commit e377e16. Signed-off-by: lifubang <lifubang@acmcoder.com>
7eac8b8
to
df04ed4
Compare
This patch also works, while still allowing us to use diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index c771ac7e1165..319899bd9b71 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -15,6 +15,7 @@
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
+#include <pthread.h> /* _only_ used for pthread_self() in debug log */
#include <sys/ioctl.h>
#include <sys/prctl.h>
@@ -319,7 +320,41 @@ static int clone_parent(jmp_buf *env, int jmpval)
.jmpval = jmpval,
};
- return clone(child_func, ca.stack_ptr, CLONE_PARENT | SIGCHLD, &ca);
+ /*
+ * Since glibc 2.25 (see c579f48edba88380635ab98cb612030e3ed8691e),
+ * glibc no longer updates the TLS state containing the current process
+ * tid after clone(2). This results in stale TIDs being used when Go
+ * 1.22 and later call pthread_gettattr_np(pthread_self()), resulting
+ * in crashes on ancient glibcs and errors on newer glibcs.
+ *
+ * Luckily, because the same address is used for CLONE_PARENT_SETTID,
+ * we can poke around in glibc's internal cache by getting the address
+ * using PR_GET_TID_ADDRESS (only available in Linux >= 3.5, with
+ * CONFIG_CHECKPOINT_RESTORE=y) and then overwriting it with
+ * CLONE_CHILD_SETTID. CLONE_CHILD_CLEARTID is set to allow descendant
+ * PR_GET_TID_ADDRESS calls to work, as well as matching what glibc
+ * does in arch_fork().
+ *
+ * Yes, this is pretty horrific, but the core issue here is that we
+ * need to run Go code ("runc init") in the child after fork(), which
+ * is not allowed by glibc (see signal-safety(7)). We cannot exec to
+ * solve the problem because we are in a security critical situation
+ * here, and doing an exec would allow for container escapes (obvious
+ * issues include that the shared libraries loaded from a re-exec would
+ * come from the container, and doing an exec here would clear the bit
+ * that makes non-dumpable flags effective for userns containers with
+ * CAP_SYS_PTRACE).
+ */
+ pid_t *tid_addr = NULL;
+ if (prctl(PR_GET_TID_ADDRESS, &tid_addr) < 0)
+ /* what should we do here... */;
+ write_log(DEBUG, "nsenter clone: get_tid_address gave us %p (pthread_self=%p)", tid_addr, (void *) pthread_self());
+ if (!tid_addr || *tid_addr != gettid())
+ write_log(WARNING, "nsenter clone: glibc private tid address is wrong: *%p %d != gettid() %d", tid_addr, tid_addr ? *tid_addr : -1, gettid());
+
+ return clone(child_func, ca.stack_ptr,
+ CLONE_PARENT | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, &ca,
+ NULL /* parent_tid */ , NULL /* tls */ , tid_addr);
}
/* Returns the clone(2) flag for a namespace, given the name of a namespace. */ @kolyshkin wdyt? |
@lifubang I can also take a look next week at whether we can somehow remove stage-1 so that we don't need a grandchild (which would remove the need for |
This PR needs some refactor work, so convert it to draft state. Line 184 in df04ed4
|
As the description in #4233, there is a bug in glibc, pthread_self()
will return wrong info after we do
clone(CLONE_PARENT)
in libct/nsenter,it will cause runc can't work in
go 1.22.*
. So we use fork(2) to replaceclone(2) in libct/nsenter, but there is a double-fork in nsenter, so we
need to use
PR_SET_CHILD_SUBREAPER
to let runc can reap grand childprocess in libct/nsenter.
Fix #4233