Skip to content

Commit

Permalink
[wip] nsenter: overwrite glibc's internal tid cache on clone()
Browse files Browse the repository at this point in the history
Since glibc 2.25, the thread-local cache of the current TID is no longer
updated in the child when calling clone(2). This results in very
unfortunate behaviour when Go does pthread calls using pthread_self(),
which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value.
Unfortunately (and unsurprisingly), the layout of "struct pthread" is
strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with
the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as
long as runc is using glibc, when "runc init" is spawned the child
process will have a pointer directly to the cached value we want to
change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later,
we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need
to memory scan the TLS structure (pthread_self() returns a pointer to
the start of the structure so we can "just" scan it for a field
containing the current TID and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this in the
future, it almost certainly has caused some horrific bug that I did not
forsee. Sorry about that. As far as I can tell, there is no other
workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID
behaviour of glibc in some way. We cannot "just" do a re-exec after
clone(2) for security reasons.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
  • Loading branch information
cyphar committed Apr 14, 2024
1 parent 5e0ec3f commit 4cd85ef
Showing 1 changed file with 69 additions and 1 deletion.
70 changes: 69 additions & 1 deletion libcontainer/nsenter/nsexec.c
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h> /* only used for pthread_self() -- see clone_parent() */

#include <sys/ioctl.h>
#include <sys/prctl.h>
Expand Down Expand Up @@ -319,7 +320,74 @@ static int clone_parent(jmp_buf *env, int jmpval)
.jmpval = jmpval,
};

return clone(child_func, ca.stack_ptr, CLONE_PARENT | SIGCHLD, &ca);
/*
* Since glibc 2.25 (see c579f48edba88380635ab98cb612030e3ed8691e),
* glibc no longer updates the TLS state containing the current process
* tid after clone(2). This results in stale TIDs being used when Go
* 1.22 and later call pthread_gettattr_np(pthread_self()), resulting
* in crashes on ancient glibcs and errors on newer glibcs.
*
* Luckily, because the same address is used for CLONE_PARENT_SETTID,
* we can poke around in glibc's internal cache by getting the address
* using PR_GET_TID_ADDRESS (only available in Linux >= 3.5, with
* CONFIG_CHECKPOINT_RESTORE=y) and then overwriting it with
* CLONE_CHILD_SETTID. CLONE_CHILD_CLEARTID is set to allow descendant
* PR_GET_TID_ADDRESS calls to work, as well as matching what glibc
* does in arch_fork().
*
* Yes, this is pretty horrific, but the core issue here is that we
* need to run Go code ("runc init") in the child after fork(), which
* is not allowed by glibc (see signal-safety(7)). We cannot exec to
* solve the problem because we are in a security critical situation
* here, and doing an exec would allow for container escapes (obvious
* issues include that the shared libraries loaded from a re-exec would
* come from the container, and doing an exec here would clear the bit
* that makes non-dumpable flags effective for userns containers with
* CAP_SYS_PTRACE).
*/

pid_t *tid_addr = NULL;
pid_t actual_tid = gettid();
if (prctl(PR_GET_TID_ADDRESS, &tid_addr) < 0) {
/*
* We couldn't get &PTHREAD_SELF->tid, probably meaning we are running
* with CONFIG_CHECKPOINT_RESTORE=n. Unfortunately the layout of
* "struct pthread" is not public, but we need to get the address by
* force.
*
* So, scan the structure as though it were pid_t[] to find the first
* element which matches the actual tid of the current process. Yes,
* this is *much* worse than PR_GET_TID_ADDRESS, but we should never
* get here on the vast majority of machines.
*
* (To be honest, maybe it's better to just hope Go doesn't notice any
* issues with glibc rather than trying to hack internal glibc
* structures to make them "work" with Go. But it seems we need to do
* this...)
*/

write_log(WARNING, "clone: PR_GET_TID_ADDRESS failed (%m): falling back to scanning pthread_self -- please use a kernel with CONFIG_CHECKPOINT_RESTORE=y");

pid_t *fake_tid_array = (pid_t *) pthread_self();
/* On my machine, fake_tid_array[180] is PTHREAD_SELF->tid. */
for (size_t i = 0; i < 512; i++) {
if (fake_tid_array[i] == actual_tid) {
tid_addr = &fake_tid_array[i];
write_log(DEBUG, "clone: using %p as tid address (pthread_self+0x%lx, index %ld)", tid_addr, i * sizeof(*fake_tid_array), i);
break;
}
}
}

write_log(DEBUG, "clone: get_tid_address gave us %p (pthread_self=%p)", tid_addr, (void *) pthread_self());
if (!tid_addr)
write_log(WARNING, "clone: could not get glibc-private tid address");
else if (*tid_addr != actual_tid)
write_log(WARNING, "clone: glibc private tid address is wrong: *%p %d != gettid() %d", tid_addr, *tid_addr, actual_tid);

return clone(child_func, ca.stack_ptr,
CLONE_PARENT | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, &ca,
NULL /* parent_tid */ , NULL /* tls */ , tid_addr /* child_tid */);
}

/* Returns the clone(2) flag for a namespace, given the name of a namespace. */
Expand Down

0 comments on commit 4cd85ef

Please sign in to comment.