nsenter: overwrite glibc's internal tid cache on clone() #4247

cyphar · 2024-04-14T17:46:53Z

An alternative to #4193.

I will also take a look in a separate PR as to whether we can remove a stage from our three-stage structure (such as moving the mapping requests to the Go parent, maybe?) to remove the need to PR_SET_CHILD_SUBREAPER in #4193. If we do it that way then switching to fork() would be far less of an issue.

Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() returns a pointer to the start of the structure so we can "just" scan it for a field containing the current TID and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons.

Fixes #4233
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

kolyshkin

While looking quite scary, this resolves my concern about systems with CONFIG_CHECKPOINT_RESTORE not set, so SGTM.

Was thinking of using a list of known offsets first, before resorting to linear scan. Can try different distros to get those offsets (I'm pretty sure there are only 2 or 3).

512 as a upper size limit is also looking scary (withput haven't checked the actual struct pthread size).

Maybe move the kludge to a separate function, and #ifdef it out unless we're using GLIBC.

kolyshkin · 2024-04-14T19:58:54Z

Since glibc 2.25,

So maybe we can ifdef the kludge with something like __GLIBC_PREREQ(2, 25)

cyphar · 2024-04-15T07:31:07Z

Was thinking of using a list of known offsets first, before resorting to linear scan. Can try different distros to get those offsets (I'm pretty sure there are only 2 or 3).

Yeah, I was thinking of that too but wasn't sure how variable all the structures are before PTHREAD_SELF->tid. However, looking at it again, apparently tcbhead_t's size is required to stay the same because of AddressSanitizer, so there might only be a couple of offsets we need in almost all cases. Then again, tcbhead_t is arch-specific (and the AddressSanitizer comment is only in sysdeps/x86_64/nptl/tls.h) -- though since quite a few have the same layout there's probably less than 10 we need...

512 as a upper size limit is also looking scary (withput haven't checked the actual struct pthread size).

Yeah, unfortunately we can't be sure of the size of struct pthread. On my machine the tid is at pthread_self+0x2d0 (index 180) so my initial guess of 256 felt a little small. But looking at the definitions of tcbhead_t, it seems amd64 has the largest structure of any architecture and so 180 is probably about the largest offset we could ever get.

Maybe move the kludge to a separate function, and #ifdef it out unless we're using GLIBC.

Will do.

So maybe we can ifdef the kludge with something like __GLIBC_PREREQ(2, 25)

What matters is the runtime version, so we'd need to parse gnu_get_libc_version if we really want to gate it. glibc has used CLONE_CHILD_CLEARTID since at least 2003 so the hack will work for ancient glibcs. I'm not sure if parsing gnu_get_libc_version is nicer.

lifubang · 2024-04-15T10:16:41Z

libcontainer/nsenter/nsexec.c

+
+		pid_t *fake_tid_array = (pid_t *) pthread_self();
+		/* On my machine, fake_tid_array[180] is PTHREAD_SELF->tid. */
+		for (size_t i = 0; i < 512; i++) {


What a wonderful solution!
I think this looks like Credential-Stuffing attack? Is there a chance that some other var's value happens equal to actual_tid?

I'm not sure I'd call it "wonderful" 😅

It is possible for another field to have the same value, and bad things might happen if we overwrite it. I've to implemented @kolyshkin's suggestion of trying some known field offsets first, and only doing a linear scan if the known offsets don't match (which should never happen in practice, unless glibc changes their structure layouts).

I'm not sure I'd call it "wonderful" 😅

I have tried CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID, but I didn't think out using it to rewrite the tid value in pthread_t. 👍

cyphar · 2024-04-15T15:39:21Z

@kolyshkin I went with writing dummy versions of the structure layouts so we can just use sizeof and offsetof. Would you prefer it just be a list of magic constants (which would be much shorter, but would need some care taken for pointer sizes for architectures where there are 32- and 64-bit versions)?

dims · 2024-04-16T12:43:37Z

libcontainer/nsenter/nsexec.c

+static pid_t *find_tls_tid_address(void)
+{
+	/*
+	 * Since glibc 2.25 (see c579f48edba88380635ab98cb612030e3ed8691e),


https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=commit;h=c579f48edba88380635ab98cb612030e3ed8691e

AkihiroSuda · 2024-04-16T18:46:43Z

libcontainer/nsenter/nsexec.c

-#if __GLIBC__ == 2 && __GLIBC_MINOR__ < 14
-#  define _GNU_SOURCE
-#  include "syscall.h"
+#if GLIBC_VERSION < 2014	/* < 2.14 */


Why not just keep #if __GLIBC__ == 2 && __GLIBC_MINOR__ < 14

In the latest version I'm about to push, I switched them to using __GLIBC_PREREQ which is even simpler.

While nobody is going to be using glibc 1.x and glibc has said there will never be a 3.x, technically doing the version check this way is not correct.

cyphar · 2024-04-17T13:46:04Z

On musl, it seems they:

Use CLONE_CHILD_CLEARTID for some TLS magic to indicate when a thread has died, which means that PR_GET_TID_ADDRESS will return the wrong address.
Don't allow setting CLONE_CHILD_CLEARTID for the clone syscall.

So for musl we would need to do the linear scan and then set the address manually using set_tid_address(2). While Go does support musl builds, should we put any effort into making sure this hack works with musl? AFAICS the pthreads_getattr_np(pthread_self(), ...) issue will also apply to musl? @kolyshkin ?

kolyshkin · 2024-04-18T19:34:27Z

So for musl we would need to do the linear scan and then set the address manually using set_tid_address(2). While Go does support musl builds, should we put any effort into making sure this hack works with musl? AFAICS the pthreads_getattr_np(pthread_self(), ...) issue will also apply to musl? @kolyshkin ?

I did some bare bone testing using the following Dockerfile (had to use alpine edge as the latest tagged version still uses go 1.21):

FROM alpine:edge
RUN apk add --no-cache git make go curl gawk iptables pkgconf python3 sshfs sudo iproute2 bats libseccomp-dev util-linux vim jq
RUN git config --global --add safe.directory /go/src/github.com/opencontainers/runc

WORKDIR /go/src/github.com/opencontainers/runc

With the above Dockerfile, I built runc and ran some tests (make all && bats tests/integration/exec.bats), for some reason it works (found some other non-related bugs, will send a PR or two). Will dig deeper later; maybe we also need to add a CI job doing that.

For now, I think, it does not make sense to fix what's not broken.

Adapted from: opencontainers/runc#4247 Execution of a container using a PID namespace can fail on certain versions of glibc when Singularity is built with Go 1.22. This is due to Go 1.22 performing calls using pthread_self which, from glibc 2.25, is not updated for the current TID on clone. Fixes sylabs#2677 ----- Original runc explanation: Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored. The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and can change without warning. Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() returns a pointer to the start of the structure so we can "just" scan it for a field containing the current TID and assume that it is the correct field). Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons. Fixes opencontainers/runc#4233 Signed-off-by: Aleksa Sarai cyphar@cyphar.com

cyphar · 2024-04-20T06:07:43Z

@kolyshkin It probably works because Go 1.22 still ignores the error from pthread_getattr_np and so it continues running as though the pthread request worked. The proposed patch in the original Go issue thread would cause an error to panic, but they haven't merged it yet.

I would also prefer to not merge this code if it isn't necessary, but I'm worried there is the possibility of some part of Go's runtime hitting an issue on production machines. Then again, as far as I can tell from looking into it, in practice things seem to just happen to work out.

pthread_getattr_np is used to determine the bounds of the stack (see src/runtime/cgo/gcc_stack_unix.c) for two purposes:

To handle a possible case where CGo gets called on a thread where the Go runtime hasn't run (see src/runtime/cgocall.go).
To figure out what stack size is needed for threads when the runtime spawns threads in CGo-enabled Go processes (see src/runtime/cgo/gcc_linux.c).

As far as I can tell, both uses just happen to work:

We are never in a situation where callbackUpdateSystemStack will be called, but in addition callbackUpdateSystemStack assumes that _cgo_getstackbound can fail silently (as it does on Windows) and so it has some reasonable fallbacks that it uses if the returned stacks are NULL.
pthread_create will pick reasonable defaults when called with an attribute struct that has a zero-sized stack size configured (see allocate_stack).

Now, it is entirely possible that some other aspect of Go could have issues with our clone setup. But I agree that unless they merge the PR where they will panic Go if pthread_getattr_np(pthread_self(), ...) returns an error, this patch is probably not necessary...

kolyshkin · 2024-04-23T00:23:32Z

libcontainer/nsenter/nsexec.c

+	/* #if TLS_DTV_AT_TP */
+	struct pthread__dtv_at_tp {
+		union {
+			struct {
+				int multiple_threads, gscope_flag;
+			} header;
+			void *__padding[24];
+		};
+		struct {
+			void *prev, *next;
+		} list;
+		pid_t tid;	/* the field we are looking for! */
+	};
+
+	/* #if !TLS_DTV_AT_TP */
+	struct pthread__tcbhead {
+		union {
+			struct tcbhead_t header;
+			void *__padding[24];
+		};
+		struct {
+			void *prev, *next;
+		} list;
+		pid_t tid;	/* the field we are looking for! */
+	};


It looks like TLS_DTV_AT_TP / TLS_TCB_AT_TP is arch-specific:

i386, x86_64, s390 and sparc have TLS_TCB_AT_TP

all other architectures have TLS_DTV_AT_TP

this has never changed as far as I can see from the git history

We can probably use this and only try one offset. Or do you think that it may change in the future?

I mean, do something like this:

// The following is a summary from glibc <sysdeps/.../ntpl/tls.h>. #if defined(__x86_64__) || defined(__i386__) || defined(__s390__) || defined(__s390x__) || defined(__sparc__) #define TLS_TCB_AT_TP 1 #define TLS_DTV_AT_TP 0 #else #define TLS_DTV_AT_TP 1 #define TLS_TCB_AT_TP 0 #endif #ifdef TLS_TCB_AT_TP // XXX define struct tcbhead_t as you did above. // XXX No need to include powerpc version as it's not used. #endif // This is adopted from glibc <nptl/descr.h>. struct _glibc_pthread { union { #if TLS_TCB_AT_TP // !TLS_DTV_AT_TP tcbhead_t header; #else struct { int multiple_threads; int gscope_flag; } header; #endif void *__padding[24]; }; struct { void *prev, *next; } list; pid_t tid; /* the field we are looking for! */ }

kolyshkin · 2024-04-23T00:24:40Z

libcontainer/nsenter/nsexec.c

+#endif
+	};
+
+	/* #if TLS_DTV_AT_TP */


Maybe add a note this is taken from glibc's nptl/descr.h.

kolyshkin · 2024-04-23T01:29:25Z

The only question remains, do we need this kludge for all Go versions of only for Go 1.22+. I tentatively vote for the latter, based on "don't fix if not broken" principle.

Can do this adding a go file like this:

//go:build go1.22

package nsenter

// #cgo CFLAGS: -DRUNC_GLIBC_TID_KLUDGE=1
import "C"

kolyshkin · 2024-04-26T18:19:51Z

@cyphar shall we move this out of draft?

libcontainer/nsenter/nsexec.c

lifubang · 2024-04-29T11:17:39Z

libcontainer/nsenter/nsexec.c

+	else
+		write_log(DEBUG, "clone: find_glibc_tls_tid_address: PR_GET_TID_ADDRESS failed: %m");
+	write_log(WARNING,
+		  "clone: find_glibc_tls_tid_address: falling back to scanning pthread_self(). Please use a kernel with CONFIG_CHECKPOINT_RESTORE=y.");


Maybe we need to add a ci test for CONFIG_CHECKPOINT_RESTORE!=y?

Hard to implement as for that we need a kernel with CONFIG_CHECKPOINT_RESTORE not set

I understand. Maybe we can use an env to skip prctl(PR_GET_TID_ADDRESS) when testing, for example:
950ff28

lifubang · 2024-04-29T11:19:15Z

libcontainer/nsenter/nsexec.c

+	 * cover every architecture without a huge risk of SIGSEGV.
+	 */
+	int i;
+	for (i = 0; i < 1024; i += sizeof(pid_t))


Maybe we should remove this fail back? Because this may get wrong offset.

Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored. The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and could change without warning. Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() is a pointer to the head of the TLS structure). However, to avoid false positives we first try known-correct offsets based on the current structure layouts. If that fails, we scan the 1K block for any fields that might match. When doing the scan, we assume that the first field we find that contains the actual TID of the current process is the field we want. Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons. Sadly, this is all glibc-specific. musl doesn't even allow you to use CLONE_CHILD_CLEARTID (and they use a different address for the TID anyway). We could do the memory scan and manually overwrite the address after clone(2), but we can deal with that in the future if it turns out people use non-glibc builds and need this fix. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

AkihiroSuda · 2024-05-09T21:59:06Z

Is this still WIP?

kolyshkin reviewed Apr 14, 2024

View reviewed changes

lifubang reviewed Apr 15, 2024

View reviewed changes

cyphar force-pushed the nsexec-glibc-stale-tid branch from 4cd85ef to d9f1e23 Compare April 15, 2024 15:37

cyphar force-pushed the nsexec-glibc-stale-tid branch 5 times, most recently from d90f74e to fcb0926 Compare April 16, 2024 04:47

dims reviewed Apr 16, 2024

View reviewed changes

AkihiroSuda reviewed Apr 16, 2024

View reviewed changes

cyphar force-pushed the nsexec-glibc-stale-tid branch 4 times, most recently from 7027163 to 69ffa06 Compare April 17, 2024 15:27

cyphar changed the title ~~[wip] nsenter: overwrite glibc's internal tid cache on clone()~~ nsenter: overwrite glibc's internal tid cache on clone() Apr 17, 2024

dtrudg mentioned this pull request Apr 19, 2024

fix: starter: overwrite glibc's internal tid cache on clone() sylabs/singularity#2837

Draft

kolyshkin reviewed Apr 23, 2024

View reviewed changes

cyphar force-pushed the nsexec-glibc-stale-tid branch from 69ffa06 to b0654c7 Compare April 29, 2024 02:09

rata reviewed Apr 29, 2024

View reviewed changes

libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved

lifubang reviewed Apr 29, 2024

View reviewed changes

cyphar force-pushed the nsexec-glibc-stale-tid branch from b0654c7 to e10daeb Compare April 30, 2024 05:16

cyphar marked this pull request as ready for review May 11, 2024 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsenter: overwrite glibc's internal tid cache on clone() #4247

nsenter: overwrite glibc's internal tid cache on clone() #4247

cyphar commented Apr 14, 2024

kolyshkin left a comment

kolyshkin commented Apr 14, 2024

cyphar commented Apr 15, 2024 •

edited

lifubang Apr 15, 2024

cyphar Apr 15, 2024

lifubang Apr 15, 2024

cyphar commented Apr 15, 2024 •

edited

dims Apr 16, 2024

AkihiroSuda Apr 16, 2024

cyphar Apr 17, 2024

cyphar commented Apr 17, 2024 •

edited

kolyshkin commented Apr 18, 2024

cyphar commented Apr 20, 2024

kolyshkin Apr 23, 2024

kolyshkin Apr 23, 2024

kolyshkin Apr 23, 2024

kolyshkin commented Apr 23, 2024

kolyshkin commented Apr 26, 2024

lifubang Apr 29, 2024

kolyshkin May 10, 2024

lifubang May 11, 2024

lifubang Apr 29, 2024

AkihiroSuda commented May 9, 2024

nsenter: overwrite glibc's internal tid cache on clone() #4247

Are you sure you want to change the base?

nsenter: overwrite glibc's internal tid cache on clone() #4247

Conversation

cyphar commented Apr 14, 2024

kolyshkin left a comment

Choose a reason for hiding this comment

kolyshkin commented Apr 14, 2024

cyphar commented Apr 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar commented Apr 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar commented Apr 17, 2024 • edited

kolyshkin commented Apr 18, 2024

cyphar commented Apr 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolyshkin commented Apr 23, 2024

kolyshkin commented Apr 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AkihiroSuda commented May 9, 2024

cyphar commented Apr 15, 2024 •

edited

cyphar commented Apr 15, 2024 •

edited

cyphar commented Apr 17, 2024 •

edited