[wip] nsenter: overwrite glibc's internal tid cache on clone()

Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored. The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and could change without warning. Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() is a pointer to the head of the TLS structure). However, to avoid false positives we first try known-correct offsets based on the current structure layouts. If that fails, we scan the 1K block for any fields that might match. When doing the scan, we assume that the first field we find that contains the actual TID of the current process is the field we want. Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
opencontainers · Apr 15, 2024 · d9f1e23 · d9f1e23
1 parent 5e0ec3f
commit d9f1e23
Showing 1 changed file with 216 additions and 1 deletion.
diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
@@ -13,8 +13,10 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdbool.h>
+#include <stddef.h>
 #include <string.h>
 #include <unistd.h>
+#include <pthread.h> /* only used for pthread_self() -- see clone_parent() */
 
 #include <sys/ioctl.h>
 #include <sys/prctl.h>
@@ -311,6 +313,214 @@ static int child_func(void *arg)
 	longjmp(*ca->env, ca->jmpval);
 }
 
+static pid_t *find_tls_tid_address(void)
+{
+	/*
+	 * Since glibc 2.25 (see c579f48edba88380635ab98cb612030e3ed8691e),
+	 * glibc no longer updates the TLS state containing the current process
+	 * tid after clone(2). This results in stale TIDs being used when Go
+	 * 1.22 and later call pthread_gettattr_np(pthread_self()), resulting
+	 * in crashes on ancient glibcs and errors on newer glibcs.
+	 *
+	 * Luckily, because the same address is used for CLONE_PARENT_SETTID,
+	 * we can poke around in glibc's internal cache by getting the address
+	 * using PR_GET_TID_ADDRESS (only available in Linux >= 3.5, with
+	 * CONFIG_CHECKPOINT_RESTORE=y) and then overwriting it with
+	 * CLONE_CHILD_SETTID. CLONE_CHILD_CLEARTID is set to allow descendant
+	 * PR_GET_TID_ADDRESS calls to work, as well as matching what glibc
+	 * does in arch_fork().
+	 *
+	 * Yes, this is pretty horrific, but the core issue here is that we
+	 * need to run Go code ("runc init") in the child after fork(), which
+	 * is not allowed by glibc (see signal-safety(7)). We cannot exec to
+	 * solve the problem because we are in a security critical situation
+	 * here, and doing an exec would allow for container escapes (obvious
+	 * issues include that the shared libraries loaded from a re-exec would
+	 * come from the container, and doing an exec here would clear the bit
+	 * that makes non-dumpable flags effective for userns containers with
+	 * CAP_SYS_PTRACE).
+	 */
+
+	pid_t *tid_addr = NULL;
+	pid_t actual_tid = gettid();
+
+	/*
+	if (!prctl(PR_GET_TID_ADDRESS, &tid_addr))
+		goto got_tid_addr;
+	*/
+
+	/*
+	 * If we cannot use PR_GET_TID_ADDRESS to get &PTHREAD_SELF->tid, we
+	 * are probably running on a CONFIG_CHECKPOINT_RESTORE=n kernel.
+	 * Unfortunately the layout of "struct pthread" is not public, so we
+	 * need to get the address by force.
+	 *
+	 * So, we treat the structure as though it were pid_t[] to find an
+	 * offset whose value matches the tid of the current process. In order
+	 * to avoid accidentally choosing an offset in some internal data
+	 * structure in tcbhead_t, we first try some known-correct offsets on
+	 * the current architecture. If none of those work, we do a linear
+	 * scan. Yes, this is *much* worse than PR_GET_TID_ADDRESS and is
+	 * pretty terrifying, but we should never get here on the vast majority
+	 * of machines.
+	 *
+	 * (To be honest, maybe it's better to just hope Go doesn't notice any
+	 * issues with glibc rather than trying to hack internal glibc
+	 * structures to make them "work" with Go. But it seems we need to do
+	 * this...)
+	 */
+
+	write_log(WARNING, "clone: PR_GET_TID_ADDRESS failed (%m): falling back to scanning pthread_self -- please use a kernel with CONFIG_CHECKPOINT_RESTORE=y");
+
+	/*
+	 * These offsets are based on glibc 2.39, but the layout of struct
+	 * pthread (at least up to the tid field) has been stable for several
+	 * decades. The cached pid (from pre-2.25 glibc) was stored after the
+	 * tid field, so even on ancient glibc versions it's "safe" for us to
+	 * do this.
+	 *
+	 * The structure layouts can be found in <sysdeps/.../ntpl/tls.h>.
+	 */
+
+#if defined(__x86_64__)
+	struct tcbhead_t {
+		void *tcb, *dtv, *self;
+		int multiple_threads, gscope_flag;
+		uintptr_t sysinfo, stack_guard, pointer_guard;
+		unsigned long int unused_vgetcpu_cache[2];
+		unsigned int feature_1;
+		int __glibc_unused1;
+		void *__private_tm[4];
+		void *__private_ss;
+		unsigned long long int ssp_base;
+		// int128_t[8][4]
+		int __glibc_unused[4][8][4] __attribute__ ((aligned (32)));
+		void *__padding[8];
+	};
+#elif defined(__i386__)
+	struct tcbhead_t {
+		void *tcb, *dtv, *self;
+		int multiple_threads:
+		uintptr_t sysinfo, stack_guard, pointer_guard;
+		int gscope_flag;
+		unsigned int feature_1;
+		void *__private_tm[3];
+		void *__private_ss;
+		unsigned long ssp_base;
+	};
+#elif defined(__powerpc__) || defined(__powerpc64__)
+	struct tcbhead_t {
+		uint64_t hwcap_extn, hwcap;
+#	ifndef __powerpc64__
+		uint32_t padding, at_platform;
+#	endif
+		uint32_t __unused;
+#	ifdef __powerpc64__
+		uint32_t at_platform;
+#	endif
+		uintptr_t dso_slot2, dso_slot1, tar_save;
+		void *__private_ss;
+		uintptr_t ebb_handler, ebb_ctx_pointer, ebb_reserved1,
+			  ebb_reserved2, pointer_guard, stack_guard;
+		void *dtv;
+	};
+#elif defined(__s390__) || defined(__s390x__)
+	struct tcbhead_t {
+		void *tcb, *dtv, *self;
+		int multiple_threads;
+		uintptr_t sysinfo;
+		uintptr_t stack_guard;
+		int gscope_flag;
+		int __glibc_reserved1;
+		void *__private_ss;
+	};
+#elif defined(__sparc__)
+	struct tcbhead_t {
+		void *tcb, *dtv, *self;
+		int multiple_threads;
+#	if __WORDSIZE == 64
+		int gscope_flag;
+#	endif
+		uintptr_t sysinfo;
+		uintptr_t stack_guard;
+		uintptr_t pointer_guard;
+#	if __WORDSIZE != 64
+		int gscope_flag;
+#	endif
+	};
+
+#else
+	/* All other architectures have an identical structure. */
+	struct tcbhead_t {
+		void *dtv, *private;
+	};
+#endif
+
+	/* #if TLS_DTV_AT_TP */
+	struct pthread__dtv_at_tp {
+		union {
+			struct { int multiple_threads, gscope_flag; } header;
+			void *__padding[24];
+		};
+		struct { void *prev, *next; } list;
+		pid_t tid; /* the field we are looking for! */
+	};
+
+	/* #if !TLS_DTV_AT_TP */
+	struct pthread__tcbhead {
+		union {
+			struct tcbhead_t header;
+			void *__padding[24];
+		};
+		struct { void *prev, *next; } list;
+		pid_t tid; /* the field we are looking for! */
+	};
+
+
+#define TRY_TID_OFFSET(offset)						\
+	do {								\
+		size_t __idx = (offset);				\
+		pid_t *__addr = (pid_t *) (pthread_self() + __idx);	\
+		if (*__addr == actual_tid) {				\
+			tid_addr = __addr;				\
+			write_log(DEBUG, "clone: find_tls_tid_address: using %p as tid address (pthread_self+0x%zx, index %zu)", \
+				  tid_addr, __idx, __idx / sizeof(pid_t)); \
+			goto got_tid_addr;				\
+		}							\
+	} while (0)
+
+	/* First, try the known-good address offsets. */
+	TRY_TID_OFFSET(offsetof(struct pthread__tcbhead, tid));
+	TRY_TID_OFFSET(offsetof(struct pthread__dtv_at_tp, tid));
+
+	write_log(DEBUG, "clone: find_tls_tid_address: known offsets 0x%zx and 0x%zx failed -- falling back to brute-force linear scan",
+		  offsetof(struct pthread__dtv_at_tp, tid),
+		  offsetof(struct pthread__tcbhead, tid));
+
+	/*
+	 * If the known offsets are wrong, we have to fall back to a linear
+	 * scan. The pid_t will always be aligned, so we check in blocks of
+	 * sizeof(pid_t). This could result in the wrong address, but there
+	 * isn't a better option unfortunately.
+	 *
+	 * On my x86_64 machine, sizeof(struct pthread) is 724. x86_64 has the
+	 * largest struct pthread, so scanning up to an offset of 1024 should
+	 * cover every architecture without a huge risk of SIGSEGV.
+	 */
+	int i;
+	for (i = 0; i < 1024; i += sizeof(pid_t))
+		TRY_TID_OFFSET(i);
+
+got_tid_addr:
+	if (!tid_addr)
+		write_log(WARNING, "clone: could not get glibc-private tid address");
+	else if (*tid_addr != actual_tid)
+		write_log(WARNING, "clone: glibc private tid address is wrong: *%p %d != gettid() %d", tid_addr, *tid_addr, actual_tid);
+	else
+		write_log(DEBUG, "clone: found seemingly viable tid address %p (pthread_self=%p)", tid_addr, (void *) pthread_self());
+	return tid_addr;
+}
+
 static int clone_parent(jmp_buf *env, int jmpval) __attribute__((noinline));
 static int clone_parent(jmp_buf *env, int jmpval)
 {
@@ -319,7 +529,12 @@ static int clone_parent(jmp_buf *env, int jmpval)
 		.jmpval = jmpval,
 	};
 
-	return clone(child_func, ca.stack_ptr, CLONE_PARENT | SIGCHLD, &ca);
+	/* Make sure the child has the correct PTHREAD_SELF->tid. */
+	pid_t *tid_addr = find_tls_tid_address();
+
+	return clone(child_func, ca.stack_ptr,
+		     CLONE_PARENT | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, &ca,
+		     NULL /* parent_tid */ , NULL /* tls */ , tid_addr /* child_tid */);
 }
 
 /* Returns the clone(2) flag for a namespace, given the name of a namespace. */