Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the error of runc doesn't work with go1.22 #4193

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

lifubang
Copy link
Member

@lifubang lifubang commented Feb 7, 2024

As the description in #4233, there is a bug in glibc, pthread_self()
will return wrong info after we do clone(CLONE_PARENT) in libct/nsenter,
it will cause runc can't work in go 1.22.*. So we use fork(2) to replace
clone(2) in libct/nsenter, but there is a double-fork in nsenter, so we
need to use PR_SET_CHILD_SUBREAPER to let runc can reap grand child
process in libct/nsenter.

Fix #4233

@lifubang
Copy link
Member Author

lifubang commented Feb 7, 2024

go 1.22.0 error msg:

DEBU[0000]libcontainer/dmz/cloned_binary_linux.go:202 libcontainer/dmz.IsCloned() F_GET_SEALS on /proc/self/exe failed: invalid argument 
DEBU[0000]libcontainer/dmz/cloned_binary_linux.go:177 libcontainer/dmz.CloneBinary() cloning runc-dmz binary (8736 bytes)         
DEBU[0000]libcontainer/container_linux.go:537 libcontainer.(*Container).newParentProcess() runc-dmz: using runc-dmz                     
DEBU[0000] nsexec[42599]: => nsexec container setup     
DEBU[0000] nsexec-0[42599]: ~> nsexec stage-0           
DEBU[0000] nsexec-0[42599]: spawn stage-1               
DEBU[0000] nsexec-0[42599]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[42600]: ~> nsexec stage-1           
DEBU[0000] nsexec-1[42600]: unshare remaining namespaces 
DEBU[0000] nsexec-1[42600]: spawn stage-2               
DEBU[0000] nsexec-1[42600]: request stage-0 to forward stage-2 pid (42601) 
DEBU[0000] nsexec-0[42599]: stage-1 requested pid to be forwarded 
DEBU[0000] nsexec-0[42599]: forward stage-1 (42600) and stage-2 (42601) pids to runc 
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2               
DEBU[0000] nsexec-1[42600]: signal completion to stage-0 
DEBU[0000] nsexec-1[42600]: <~ nsexec stage-1           
DEBU[0000]libcontainer/process_linux.go:457 libcontainer.(*initProcess).goCreateMountSources.func1() mount source thread: successfully running in container mntns 
DEBU[0000] nsexec-0[42599]: stage-1 complete            
DEBU[0000] nsexec-0[42599]: <- stage-1 synchronisation loop 
DEBU[0000] nsexec-0[42599]: -> stage-2 synchronisation loop 
DEBU[0000] nsexec-0[42599]: signalling stage-2 to run   
DEBU[0000] nsexec-2[1]: signal completion to stage-0    
DEBU[0000] nsexec-2[1]: <= nsexec container setup       
DEBU[0000] nsexec-2[1]: booting up go runtime ...       
DEBU[0000] nsexec-0[42599]: stage-2 complete            
DEBU[0000] nsexec-0[42599]: <- stage-2 synchronisation loop 
DEBU[0000] nsexec-0[42599]: <~ nsexec stage-0           
DEBU[0000]libcontainer/sync.go:127 libcontainer.doReadSync() reading sync                                 
DEBU[0000] sync pipe closed                             
DEBU[0000] mount source thread: closing thread: context canceled 
ERRO[0000] runc run failed: unable to start container process: error during container init: procReady not received

@kolyshkin
Copy link
Contributor

As go has released v1.22.0, so there is no 1.20.x in https://go.dev/dl/?mode=json anymore.

This can be fixed by adding &include=all (i.e. use https://go.dev/dl/?mode=json&include=all). I'll open a PR.

@kolyshkin
Copy link
Contributor

Interestingly, both runc 1.1.12 and runc from git HEAD built with go1.22.0 work fine on my machine (all tests are passing).

@kolyshkin

This comment was marked as outdated.

@kolyshkin
Copy link
Contributor

We also need to fix this for Go 1.22

# (in test file tests/integration/spec.bats, line 37)
#   `GO111MODULE=auto go get github.com/xeipuuv/gojsonschema' failed
# runc spec (status=0):
#
# Cloning into 'runtime-spec'...
# HEAD is now at 4fec88f merge #1219 into main
# go: go.mod file not found in current directory or any parent directory.
# 	'go get' is no longer supported outside a module.
# 	To build and install a command, use 'go install' with a version,
# 	like 'go install example.com/cmd@latest'
# 	For more information, see https://golang.org/doc/go-get-install-deprecation
# 	or run 'go help get' or 'go help install'.

I don't remember why I haven't switched to go install, guess it's not as easy as it seems.

@lifubang
Copy link
Member Author

lifubang commented Feb 9, 2024

Interestingly, both runc 1.1.12 and runc from git HEAD built with go1.22.0 work fine on my machine (all tests are passing).

It seems that cgo may be broken with clone(2) in go1.22.0?
golang/go#65625
PTAL

@kolyshkin
Copy link
Contributor

Interestingly, both runc 1.1.12 and runc from git HEAD built with go1.22.0 work fine on my machine (all tests are passing).

It seems that cgo may be broken with clone(2) in go1.22.0? golang/go#65625 PTAL

Again, I can't repro locally.

[kir@kir-tp1 cgoclone2]$ go version
go version go1.21.6 linux/amd64
[kir@kir-tp1 cgoclone2]$ go run main.go 
STAGE_PARENT
STAGE_CHILD
STAGE_INIT
This from nsexec
From main!
[kir@kir-tp1 cgoclone2]$ go1.22.0 version
go version go1.22.0 linux/amd64
[kir@kir-tp1 cgoclone2]$ go1.22.0 run main.go 
STAGE_PARENT
STAGE_CHILD
STAGE_INIT
This from nsexec
From main!

Maybe it's your kernel version @lifubang? Can you show uname -a?

@kolyshkin
Copy link
Contributor

@lifubang also if you can repro that (alas I can not), you can git bisect golang between 1.21.0 and 1.22.0.

@kolyshkin
Copy link
Contributor

Note in CI it happens with Ubuntu 20.04 but not Ubuntu 22.04. Will try to repro in a VM.

@kolyshkin
Copy link
Contributor

On Ubuntu 20.04, when running the binary compiled with go 1.22, I am seeing a SIGSEGV:

--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xf8} ---

Can't yet figure out what's going on there; will continue tomorrow.

@kolyshkin
Copy link
Contributor

@lifubang I did a bisect, here are the results: golang/go#65625 (comment)

Will continue tomorrow.

@kolyshkin
Copy link
Contributor

It seems that cgo may be broken with clone(2) in go1.22.0?
golang/go#65625

So, to summarize the investigation done there -- it's a glibc bug, in fact, two bugs:

  1. pthread_self() returns wrong info after we do what we do in libct/nsenter
  2. pthread_getattr_np(pthread_self(), &attr) (which Go 1.22 calls internally) does a NULL pointer dereference, so the app gets SIGABRT.

These two bugs are apparently specific to glibc used by Ubuntu 20.04 (libc6 2.31-0ubuntu9.14) and maybe also Debian 10 (libc6 2.28-10+deb10u2), as I was able to reproduce on both. With Debian 10, it even prints error from free: free(): invalid pointer, maybe due to some extra Debian-specific patches, but still gets SIGABRT.

For some reason I was unable to repro on older Fedora (F32, glibc-2.31-2.fc32, F33, glibc-2.32-10.fc33) and Debian 11 (libc6 2.31-7).

The bad news is, every version of glibc has the bug 1 above, and https://go-review.googlesource.com/c/go/+/563379 may make it so go 1.22.x will fail runc init on every version of glibc.

Meaning, we need a workaround for that. Perhaps changing runc libct/nsenter logic in some radical way, so that pthread_self works.

stgraber added a commit to zabbly/incus that referenced this pull request Feb 16, 2024
Go 1.22 currently causes crashes on older Debian/Ubuntu systems.

lxc/incus#497
golang/go#65625
opencontainers/runc#4193

Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
stgraber added a commit to zabbly/incus that referenced this pull request Feb 16, 2024
Go 1.22 currently causes crashes on older Debian/Ubuntu systems.

lxc/incus#497
golang/go#65625
opencontainers/runc#4193

Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
@AkihiroSuda
Copy link
Member

Meaning, we need a workaround for that. Perhaps changing runc libct/nsenter logic in some radical way, so that pthread_self works.

👍

@kolyshkin
Copy link
Contributor

Rebasing this to re-run with Go 1.22.1

@kolyshkin kolyshkin force-pushed the feat-go-1.21-1.22 branch 2 times, most recently from c54384f to 5907889 Compare March 28, 2024 01:16
@kolyshkin
Copy link
Contributor

Sorry @lifubang I've high-jacked your PR, needed to run it with Go 1.22.1 and added missing changes to go.sum to fix failing CI (https://github.com/opencontainers/runc/actions/runs/8460901105/job/23179867537)

@kolyshkin
Copy link
Contributor

OK, Go 1.22.1 makes no difference. I guess we have to disable Go 1.22 for now.

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NACK.

I don't see how switching to fork will fix the actual issue. The problem is that pthread_self() will return stale data after a clone(2) -- can you point to the code in glibc which ensures that the data is valid? Does this patch fix the C version of the buggy program?

Yes, the original Go debugging attempt concluded the issue was CLONE_PARENT but as I describe in #4233 the issue is more likely to be that we are breaking the rules of signal-safety(7) by running non async-signal-unsafe code (i.e. the Go runtime) in the child of a fork or clone. This results in stale thread-local data. So switching to fork won't help.

IMHO, the correct fix is to add an execve at the end of nsexec() which re-execs runc (say runc init-go) and then change the Go side of runc init to start with runc init-go or whatever. This will ensure that we reset the memory state after a clone and no longer violate signal-safety(7).


// Tell the kernel that runc wants to reap orphaned children of the
// `runc init` process.
if err := unix.Prctl(unix.PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with the entire approach, but if you want to do this you need to have add a waitid(-1) thread somewhere to make sure you actually reap the zombies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been waited the third init process before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process, err := os.FindProcess(childPid)
if err != nil {
return err
}
p.cmd.Process = process
p.process.ops = p

_, _ = p.wait()

if _, werr := p.wait(); err == nil {

if _, werr := p.wait(); err == nil {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.wait doesn't do a waitid(-1) AFAIK, it only waits for the specific child process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we need waitpid(-1) here, we should not only wait runc init, but also processes created by runc init, for example, newuidmap, newgidmap etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but also processes created by runc init, for example, newuidmap, newgidmap etc.

try_mapping_tool already does a waitpid. Now that I think about it, we should do the wait in nsexec.c and not use PR_SET_CHILD_SUBREAPER (see my below comment). Setting PR_SET_CHILD_SUBREAPER implicitly is just asking for trouble.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_mapping_tool already does a waitpid.

Yes, but if stage-0 process suddenly dead, I think runc should reap all these process.
As we all know, PR_SET_CHILD_SUBREAPER means that runc has the ability to reap grand child processes, but if the child process has not exited, runc will not reap it's grand child process.

I think the only problem is that we should tell the libct/nsexec users to add unix.Prctl(unix.PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0) in their system like in runc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Right, stage-2 is not going to be a child of the original runc so we can't wait for it there... That's unfortunate...

The reason I don't like adding PR_SET_CHILD_SUBREAPER is that it's a process-wide setting so other users of nsenter (for example, as a server spawning containers) that spawn other processes will now have to deal with subreaper semantics (and depending on their design they might really not want that).

(And, given that this only fixes the surface-level issue with glibc it feels overkill to do PR_SET_CHILD_SUBREAPER... I suspect we will eventually need to work around other issues and will end up doing an execve so we might as well rip that band-aid off now...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further think, except the three runc init process, what other processes should be reaped by runc process?
So there are two questions need to consider:

  1. Do we really need waitid(-1)? Although add waitid(-1) has no hurt;
  2. Do we need to open a new go routine to do reap before container terminated?

@lifubang

This comment was marked as outdated.

@lifubang
Copy link
Member Author

lifubang commented Apr 4, 2024

It seems that there is no error for fork.

https://public-inbox.org/libc-alpha/c432ff40-edf2-1ebe-f3a7-de76fbfdd252@redhat.com/

@cyphar
Copy link
Member

cyphar commented Apr 4, 2024

It seems that there is no error for fork.

https://public-inbox.org/libc-alpha/c432ff40-edf2-1ebe-f3a7-de76fbfdd252@redhat.com/

For those reading at home, this is bminor/glibc@c579f48. So, they removed the PID cache entirely and removed their explicit TID storage for clone.

After looking into it a bit more, the reason fork works is that glibc's fork implementation actually uses clone with CLONE_CHILD_SETTID so that the kernel will update the TLS value of the tid. See arch_fork and __tls_init_tp. (pthread-controlled threads are a little more complicated because they use CLONE_SETTLS.)

So, the issue is not with CLONE_PARENT, nor with clone(2) itself. The issue is that glibc only bothers to fill the tid field of pthread_self() if you are using fork(2). It seems strange they kept the TID cache -- if they had removed that too then clone(2) would still work...

There is a way for us to keep using clone(2) but it's quite nasty -- prctl(PR_GET_TID_ADDRESS) lets us get access to &THREAD_SELF->tid, which would let us keep using clone(2) with CLONE_CHILD_GETTID. Unfortunately that requires CONFIG_CHECKPOINT_RESTORE and is also quite ugly (and possibly fragile -- if glibc doesn't need CLONE_CHILD_SETTID anymore, this will break)...

I guess fork(2) is nicer than PR_GET_TID_ADDRESS. However it should be noted that glibc doesn't have to allow any pthread_* code to work after a fork(2) or clone(2). It's possible for a future glibc version to completely break some other part of runc init, because we are not following the rules in signal-safety(7). Just because fork(2) happens to work doesn't mean we solved the problem completely. The actual fix (as discussed in #4233) is to use execve to make sure we don't do anything to interfere with Go.

It seems that there is no error for fork.

For reference, I think the following test is better for checking the root cause of the issue (is the cached TID wrong?).

typedef struct pthread {
	/* Based on output from gdb -- this might change on different machines. */
	char __padding[720];

	pid_t tid;

	/* ... snip ... */
} pthread;

void __attribute__((constructor)) init(void)
{
		nsexec();
		pthread_t self = pthread_self();

		pthread *THREAD_SELF = (pthread *)self;
		printf("cached tid: %d ; actual tid: %d\n", THREAD_SELF->tid, gettid());

		pthread_attr_t attr;
		int ret = pthread_getattr_np(self, &attr);
		if (ret != 0) {
				printf("pthread_getattr_np: %s\n", strerror(ret));
				/* Try to destroy attr anyway. Bad idea, because getattr fails, but this is what Go does. */
				pthread_attr_destroy(&attr);
				abort();
		}
}

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like doing it with fork(2) is a better idea for the moment. I still think we need to switch to execve to make sure this is safe though...

As for the changes themselves, there are a couple of cleanups that we can do now -- and we shouldn't use PR_SET_CHILD_SUBREAPER like this, instead we should just do a waitpid in the stage-0 of nsexec.c.

libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved
libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved
libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved

// Tell the kernel that runc wants to reap orphaned children of the
// `runc init` process.
if err := unix.Prctl(unix.PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but also processes created by runc init, for example, newuidmap, newgidmap etc.

try_mapping_tool already does a waitpid. Now that I think about it, we should do the wait in nsexec.c and not use PR_SET_CHILD_SUBREAPER (see my below comment). Setting PR_SET_CHILD_SUBREAPER implicitly is just asking for trouble.

err := p.cmd.Process.Kill()
if _, werr := p.wait(); err == nil {

if _, werr := unix.Wait4(-1, nil, 0, nil); werr != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, I don't think we should to do this here -- we do need to add a wait in stage-0 instead. At the moment we send stage1_pid over the pipe, but instead we can add the wait after we finish the "stage-1 synchronisation loop" and remove the stage1_pid JSON logic entirely.

Using PR_SET_CHILD_SUBREAPER can lead to some other issues (other users of libcontainer that have other children will have unexpected reparenting behaviour, and if we do a wait here we might wait for a different child than we wanted).

(Also if we're waiting for more than one process you need to do wait in a loop -- and if you want to be sure you got all of them you need to do it until -ECHILD. But there might be other children so we can't do this and it would lead to all sorts of other issues...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do need to add a wait in stage-0 instead.

There are two problems:

  1. If the stage-0 suddenly dead, how to reap stage-1 and stage-2?
  2. For runc exec -t, the stage-0 process has exited after the container started, if we wait it in stage-0, runc process will have no way to know whether the container has exited or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we're waiting for more than one process you need to do wait in a loop -- and if you want to be sure you got all of them you need to do it until -ECHILD.

Yes, thanks.

try_mapping_tool already does a waitpid.

Yes, but if stage-0 process suddenly dead, I think runc should reap all these process.
As we all know, PR_SET_CHILD_SUBREAPER means that runc has the ability to reap grand child processes, but if the child process has not exited, runc will not reap it's grand child process.

I think the only problem is that we should tell the libct/nsexec users to add unix.Prctl(unix.PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0) in their system like in runc.

Copy link
Member Author

@lifubang lifubang Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifubang lifubang force-pushed the feat-go-1.21-1.22 branch 3 times, most recently from c8f39a3 to 7eac8b8 Compare April 4, 2024 13:41
@lifubang
Copy link
Member Author

lifubang commented Apr 4, 2024

It seems like doing it with fork(2) is a better idea for the moment.

👍

I still think we need to switch to execve to make sure this is safe though...

But maybe this will cause the issues like runc-dmz if we use execve in stage-2?

@cyphar
Copy link
Member

cyphar commented Apr 5, 2024

But maybe this will cause the issues like runc-dmz if we use execve in stage-2?

We still have full capabilities at the beginning of stage-2 (both with and without user namespaces) and haven't applied any LSM labels or anything like that. I wouldn't expect there to be any issues.

@kolyshkin

This comment was marked as outdated.

@lifubang
Copy link
Member Author

Do you think moving the c code in stage-0 and stage-2 to golang code could fix this issue or not? I don't know whether the setjmp and longjmp could work in go or not, and it can using together with C or not?
Does it worth to try?

@lifubang
Copy link
Member Author

lifubang commented Apr 11, 2024

and it can using together with C or not?

I think maybe no, because after clone(2), it has already in go routine, it can’t longjmp to C.

So, it seems that there is no other way to fix this issue?

Signed-off-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
As the description in opencontainers#4233, there is a bug in glibc, pthread_self()
will return wrong info after we do `clone(CLONE_PARENT)` in libct/nsenter,
it will cause runc can't work in `go 1.22.*`. So we use fork(2) to replace
clone(2) in libct/nsenter, but there is a double-fork in nsenter, so we
need to use `PR_SET_CHILD_SUBREAPER` to let runc can reap grand child
process in libct/nsenter.

Signed-off-by: lifubang <lifubang@acmcoder.com>
This reverts commit ac31da6.

Signed-off-by: lifubang <lifubang@acmcoder.com>
This reverts commit e377e16.

Signed-off-by: lifubang <lifubang@acmcoder.com>
@cyphar
Copy link
Member

cyphar commented Apr 13, 2024

This patch also works, while still allowing us to use CLONE_PARENT. Yes, I'm sure we agree it's not lovely, but IMHO using fork() is depending on glibc internals just as much as this is. If glibc stops using CLONE_CHILD_CLEARTID then fork() will also stop working. The only downside of this approach is that it only works with CONFIG_CHECKPOINT_RESTORE=y but I suspect most people running with containers have that enabled.

diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index c771ac7e1165..319899bd9b71 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -15,6 +15,7 @@
 #include <stdbool.h>
 #include <string.h>
 #include <unistd.h>
+#include <pthread.h> /* _only_ used for pthread_self() in debug log */
 
 #include <sys/ioctl.h>
 #include <sys/prctl.h>
@@ -319,7 +320,41 @@ static int clone_parent(jmp_buf *env, int jmpval)
 		.jmpval = jmpval,
 	};
 
-	return clone(child_func, ca.stack_ptr, CLONE_PARENT | SIGCHLD, &ca);
+	/*
+	 * Since glibc 2.25 (see c579f48edba88380635ab98cb612030e3ed8691e),
+	 * glibc no longer updates the TLS state containing the current process
+	 * tid after clone(2). This results in stale TIDs being used when Go
+	 * 1.22 and later call pthread_gettattr_np(pthread_self()), resulting
+	 * in crashes on ancient glibcs and errors on newer glibcs.
+	 *
+	 * Luckily, because the same address is used for CLONE_PARENT_SETTID,
+	 * we can poke around in glibc's internal cache by getting the address
+	 * using PR_GET_TID_ADDRESS (only available in Linux >= 3.5, with
+	 * CONFIG_CHECKPOINT_RESTORE=y) and then overwriting it with
+	 * CLONE_CHILD_SETTID. CLONE_CHILD_CLEARTID is set to allow descendant
+	 * PR_GET_TID_ADDRESS calls to work, as well as matching what glibc
+	 * does in arch_fork().
+	 *
+	 * Yes, this is pretty horrific, but the core issue here is that we
+	 * need to run Go code ("runc init") in the child after fork(), which
+	 * is not allowed by glibc (see signal-safety(7)). We cannot exec to
+	 * solve the problem because we are in a security critical situation
+	 * here, and doing an exec would allow for container escapes (obvious
+	 * issues include that the shared libraries loaded from a re-exec would
+	 * come from the container, and doing an exec here would clear the bit
+	 * that makes non-dumpable flags effective for userns containers with
+	 * CAP_SYS_PTRACE).
+	 */
+	pid_t *tid_addr = NULL;
+	if (prctl(PR_GET_TID_ADDRESS, &tid_addr) < 0)
+		/* what should we do here... */;
+	write_log(DEBUG, "nsenter clone: get_tid_address gave us %p (pthread_self=%p)", tid_addr, (void *) pthread_self());
+	if (!tid_addr || *tid_addr != gettid())
+		write_log(WARNING, "nsenter clone: glibc private tid address is wrong: *%p %d != gettid() %d", tid_addr, tid_addr ? *tid_addr : -1, gettid());
+
+	return clone(child_func, ca.stack_ptr,
+		     CLONE_PARENT | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, &ca,
+		     NULL /* parent_tid */ , NULL /* tls */ , tid_addr);
 }
 
 /* Returns the clone(2) flag for a namespace, given the name of a namespace. */

@kolyshkin wdyt?

@cyphar
Copy link
Member

cyphar commented Apr 13, 2024

@lifubang I can also take a look next week at whether we can somehow remove stage-1 so that we don't need a grandchild (which would remove the need for PR_SET_CHILD_SUBREAPER).

@lifubang
Copy link
Member Author

This PR needs some refactor work, so convert it to draft state.

runc/exec.go

Line 184 in df04ed4

enableSubreaper: false,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/todo/1.1 A PR in main branch which needs to be backported to release-1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runc doesn't work with go1.22
4 participants