Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsenter: overwrite glibc's internal tid cache on clone() #4247

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Apr 14, 2024

An alternative to #4193.

I will also take a look in a separate PR as to whether we can remove a stage from our three-stage structure (such as moving the mapping requests to the Go parent, maybe?) to remove the need to PR_SET_CHILD_SUBREAPER in #4193. If we do it that way then switching to fork() would be far less of an issue.


Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() returns a pointer to the start of the structure so we can "just" scan it for a field containing the current TID and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons.

Fixes #4233
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking quite scary, this resolves my concern about systems with CONFIG_CHECKPOINT_RESTORE not set, so SGTM.

Was thinking of using a list of known offsets first, before resorting to linear scan. Can try different distros to get those offsets (I'm pretty sure there are only 2 or 3).

512 as a upper size limit is also looking scary (withput haven't checked the actual struct pthread size).

Maybe move the kludge to a separate function, and #ifdef it out unless we're using GLIBC.

@kolyshkin
Copy link
Contributor

Since glibc 2.25,

So maybe we can ifdef the kludge with something like __GLIBC_PREREQ(2, 25)

@cyphar
Copy link
Member Author

cyphar commented Apr 15, 2024

Was thinking of using a list of known offsets first, before resorting to linear scan. Can try different distros to get those offsets (I'm pretty sure there are only 2 or 3).

Yeah, I was thinking of that too but wasn't sure how variable all the structures are before PTHREAD_SELF->tid. However, looking at it again, apparently tcbhead_t's size is required to stay the same because of AddressSanitizer, so there might only be a couple of offsets we need in almost all cases. Then again, tcbhead_t is arch-specific (and the AddressSanitizer comment is only in sysdeps/x86_64/nptl/tls.h) -- though since quite a few have the same layout there's probably less than 10 we need...

512 as a upper size limit is also looking scary (withput haven't checked the actual struct pthread size).

Yeah, unfortunately we can't be sure of the size of struct pthread. On my machine the tid is at pthread_self+0x2d0 (index 180) so my initial guess of 256 felt a little small. But looking at the definitions of tcbhead_t, it seems amd64 has the largest structure of any architecture and so 180 is probably about the largest offset we could ever get.

Maybe move the kludge to a separate function, and #ifdef it out unless we're using GLIBC.

Will do.

So maybe we can ifdef the kludge with something like __GLIBC_PREREQ(2, 25)

What matters is the runtime version, so we'd need to parse gnu_get_libc_version if we really want to gate it. glibc has used CLONE_CHILD_CLEARTID since at least 2003 so the hack will work for ancient glibcs. I'm not sure if parsing gnu_get_libc_version is nicer.


pid_t *fake_tid_array = (pid_t *) pthread_self();
/* On my machine, fake_tid_array[180] is PTHREAD_SELF->tid. */
for (size_t i = 0; i < 512; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a wonderful solution!
I think this looks like Credential-Stuffing attack? Is there a chance that some other var's value happens equal to actual_tid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'd call it "wonderful" 😅

It is possible for another field to have the same value, and bad things might happen if we overwrite it. I've to implemented @kolyshkin's suggestion of trying some known field offsets first, and only doing a linear scan if the known offsets don't match (which should never happen in practice, unless glibc changes their structure layouts).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'd call it "wonderful" 😅

I have tried CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID, but I didn't think out using it to rewrite the tid value in pthread_t. 👍

@cyphar
Copy link
Member Author

cyphar commented Apr 15, 2024

@kolyshkin I went with writing dummy versions of the structure layouts so we can just use sizeof and offsetof. Would you prefer it just be a list of magic constants (which would be much shorter, but would need some care taken for pointer sizes for architectures where there are 32- and 64-bit versions)?

@cyphar cyphar force-pushed the nsexec-glibc-stale-tid branch 5 times, most recently from d90f74e to fcb0926 Compare April 16, 2024 04:47
static pid_t *find_tls_tid_address(void)
{
/*
* Since glibc 2.25 (see c579f48edba88380635ab98cb612030e3ed8691e),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#if __GLIBC__ == 2 && __GLIBC_MINOR__ < 14
# define _GNU_SOURCE
# include "syscall.h"
#if GLIBC_VERSION < 2014 /* < 2.14 */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just keep #if __GLIBC__ == 2 && __GLIBC_MINOR__ < 14

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest version I'm about to push, I switched them to using __GLIBC_PREREQ which is even simpler.

While nobody is going to be using glibc 1.x and glibc has said there will never be a 3.x, technically doing the version check this way is not correct.

@cyphar
Copy link
Member Author

cyphar commented Apr 17, 2024

On musl, it seems they:

  1. Use CLONE_CHILD_CLEARTID for some TLS magic to indicate when a thread has died, which means that PR_GET_TID_ADDRESS will return the wrong address.
  2. Don't allow setting CLONE_CHILD_CLEARTID for the clone syscall.

So for musl we would need to do the linear scan and then set the address manually using set_tid_address(2). While Go does support musl builds, should we put any effort into making sure this hack works with musl? AFAICS the pthreads_getattr_np(pthread_self(), ...) issue will also apply to musl? @kolyshkin ?

@cyphar cyphar force-pushed the nsexec-glibc-stale-tid branch 4 times, most recently from 7027163 to 69ffa06 Compare April 17, 2024 15:27
@cyphar cyphar changed the title [wip] nsenter: overwrite glibc's internal tid cache on clone() nsenter: overwrite glibc's internal tid cache on clone() Apr 17, 2024
@kolyshkin
Copy link
Contributor

So for musl we would need to do the linear scan and then set the address manually using set_tid_address(2). While Go does support musl builds, should we put any effort into making sure this hack works with musl? AFAICS the pthreads_getattr_np(pthread_self(), ...) issue will also apply to musl? @kolyshkin ?

I did some bare bone testing using the following Dockerfile (had to use alpine edge as the latest tagged version still uses go 1.21):

FROM alpine:edge
RUN apk add --no-cache git make go curl gawk iptables pkgconf python3 sshfs sudo iproute2 bats libseccomp-dev util-linux vim jq
RUN git config --global --add safe.directory /go/src/github.com/opencontainers/runc

WORKDIR /go/src/github.com/opencontainers/runc

With the above Dockerfile, I built runc and ran some tests (make all && bats tests/integration/exec.bats), for some reason it works (found some other non-related bugs, will send a PR or two). Will dig deeper later; maybe we also need to add a CI job doing that.

For now, I think, it does not make sense to fix what's not broken.

dtrudg added a commit to dtrudg/singularity that referenced this pull request Apr 19, 2024
Adapted from: opencontainers/runc#4247

Execution of a container using a PID namespace can fail on certain
versions of glibc when Singularity is built with Go 1.22.

This is due to Go 1.22 performing calls using pthread_self which,
from glibc 2.25, is not updated for the current TID on clone.

Fixes sylabs#2677

-----

Original runc explanation:

Since glibc 2.25, the thread-local cache of the current TID is no
longer updated in the child when calling clone(2). This results in
very unfortunate behaviour when Go does pthread calls using
pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value.
Unfortunately (and unsurprisingly), the layout of "struct pthread"
is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks
(with the child_tid set to the cached &PTHREAD_SELF->tid), meaning
that as long as runc is using glibc, when "runc init" is spawned
the child process will have a pointer directly to the cached value
we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on
Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS).
For older kernels we need to memory scan the TLS structure
(pthread_self() returns a pointer to the start of the structure
so we can "just" scan it for a field containing the current TID
and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this
in the future, it almost certainly has caused some horrific bug
that I did not forsee. Sorry about that. As far as I can tell,
there is no other workable solution that doesn't also depend on the
CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot
"just" do a re-exec after clone(2) for security reasons.

Fixes opencontainers/runc#4233
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
dtrudg added a commit to dtrudg/singularity that referenced this pull request Apr 19, 2024
Adapted from: opencontainers/runc#4247

Execution of a container using a PID namespace can fail on certain
versions of glibc when Singularity is built with Go 1.22.

This is due to Go 1.22 performing calls using pthread_self which,
from glibc 2.25, is not updated for the current TID on clone.

Fixes sylabs#2677

-----

Original runc explanation:

Since glibc 2.25, the thread-local cache of the current TID is no
longer updated in the child when calling clone(2). This results in
very unfortunate behaviour when Go does pthread calls using
pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value.
Unfortunately (and unsurprisingly), the layout of "struct pthread"
is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks
(with the child_tid set to the cached &PTHREAD_SELF->tid), meaning
that as long as runc is using glibc, when "runc init" is spawned
the child process will have a pointer directly to the cached value
we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on
Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS).
For older kernels we need to memory scan the TLS structure
(pthread_self() returns a pointer to the start of the structure
so we can "just" scan it for a field containing the current TID
and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this
in the future, it almost certainly has caused some horrific bug
that I did not forsee. Sorry about that. As far as I can tell,
there is no other workable solution that doesn't also depend on the
CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot
"just" do a re-exec after clone(2) for security reasons.

Fixes opencontainers/runc#4233
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
dtrudg added a commit to dtrudg/singularity that referenced this pull request Apr 19, 2024
Adapted from: opencontainers/runc#4247

Execution of a container using a PID namespace can fail on certain
versions of glibc when Singularity is built with Go 1.22.

This is due to Go 1.22 performing calls using pthread_self which,
from glibc 2.25, is not updated for the current TID on clone.

Fixes sylabs#2677

-----

Original runc explanation:

Since glibc 2.25, the thread-local cache of the current TID is no
longer updated in the child when calling clone(2). This results in
very unfortunate behaviour when Go does pthread calls using
pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value.
Unfortunately (and unsurprisingly), the layout of "struct pthread"
is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks
(with the child_tid set to the cached &PTHREAD_SELF->tid), meaning
that as long as runc is using glibc, when "runc init" is spawned
the child process will have a pointer directly to the cached value
we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on
Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS).
For older kernels we need to memory scan the TLS structure
(pthread_self() returns a pointer to the start of the structure
so we can "just" scan it for a field containing the current TID
and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this
in the future, it almost certainly has caused some horrific bug
that I did not forsee. Sorry about that. As far as I can tell,
there is no other workable solution that doesn't also depend on the
CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot
"just" do a re-exec after clone(2) for security reasons.

Fixes opencontainers/runc#4233
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
dtrudg added a commit to dtrudg/singularity that referenced this pull request Apr 19, 2024
Adapted from: opencontainers/runc#4247

Execution of a container using a PID namespace can fail on certain
versions of glibc when Singularity is built with Go 1.22.

This is due to Go 1.22 performing calls using pthread_self which,
from glibc 2.25, is not updated for the current TID on clone.

Fixes sylabs#2677

-----

Original runc explanation:

Since glibc 2.25, the thread-local cache of the current TID is no
longer updated in the child when calling clone(2). This results in
very unfortunate behaviour when Go does pthread calls using
pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value.
Unfortunately (and unsurprisingly), the layout of "struct pthread"
is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks
(with the child_tid set to the cached &PTHREAD_SELF->tid), meaning
that as long as runc is using glibc, when "runc init" is spawned
the child process will have a pointer directly to the cached value
we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on
Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS).
For older kernels we need to memory scan the TLS structure
(pthread_self() returns a pointer to the start of the structure
so we can "just" scan it for a field containing the current TID
and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this
in the future, it almost certainly has caused some horrific bug
that I did not forsee. Sorry about that. As far as I can tell,
there is no other workable solution that doesn't also depend on the
CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot
"just" do a re-exec after clone(2) for security reasons.

Fixes opencontainers/runc#4233
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
@cyphar
Copy link
Member Author

cyphar commented Apr 20, 2024

@kolyshkin It probably works because Go 1.22 still ignores the error from pthread_getattr_np and so it continues running as though the pthread request worked. The proposed patch in the original Go issue thread would cause an error to panic, but they haven't merged it yet.

I would also prefer to not merge this code if it isn't necessary, but I'm worried there is the possibility of some part of Go's runtime hitting an issue on production machines. Then again, as far as I can tell from looking into it, in practice things seem to just happen to work out.

pthread_getattr_np is used to determine the bounds of the stack (see src/runtime/cgo/gcc_stack_unix.c) for two purposes:

  1. To handle a possible case where CGo gets called on a thread where the Go runtime hasn't run (see src/runtime/cgocall.go).
  2. To figure out what stack size is needed for threads when the runtime spawns threads in CGo-enabled Go processes (see src/runtime/cgo/gcc_linux.c).

As far as I can tell, both uses just happen to work:

  1. We are never in a situation where callbackUpdateSystemStack will be called, but in addition callbackUpdateSystemStack assumes that _cgo_getstackbound can fail silently (as it does on Windows) and so it has some reasonable fallbacks that it uses if the returned stacks are NULL.
  2. pthread_create will pick reasonable defaults when called with an attribute struct that has a zero-sized stack size configured (see allocate_stack).

Now, it is entirely possible that some other aspect of Go could have issues with our clone setup. But I agree that unless they merge the PR where they will panic Go if pthread_getattr_np(pthread_self(), ...) returns an error, this patch is probably not necessary...

Comment on lines 458 to 475
/* #if TLS_DTV_AT_TP */
struct pthread__dtv_at_tp {
union {
struct {
int multiple_threads, gscope_flag;
} header;
void *__padding[24];
};
struct {
void *prev, *next;
} list;
pid_t tid; /* the field we are looking for! */
};

/* #if !TLS_DTV_AT_TP */
struct pthread__tcbhead {
union {
struct tcbhead_t header;
void *__padding[24];
};
struct {
void *prev, *next;
} list;
pid_t tid; /* the field we are looking for! */
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like TLS_DTV_AT_TP / TLS_TCB_AT_TP is arch-specific:

  • i386, x86_64, s390 and sparc have TLS_TCB_AT_TP
  • all other architectures have TLS_DTV_AT_TP
  • this has never changed as far as I can see from the git history

We can probably use this and only try one offset. Or do you think that it may change in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, do something like this:

// The following is a summary from glibc <sysdeps/.../ntpl/tls.h>.
#if defined(__x86_64__) || defined(__i386__) || defined(__s390__) || defined(__s390x__) || defined(__sparc__)
 #define TLS_TCB_AT_TP 1
 #define TLS_DTV_AT_TP 0
#else
 #define TLS_DTV_AT_TP 1
 #define TLS_TCB_AT_TP 0
#endif

#ifdef TLS_TCB_AT_TP
// XXX define struct tcbhead_t  as you did above.
// XXX No need to include powerpc version as it's not used.
#endif

// This is adopted from glibc <nptl/descr.h>.
struct _glibc_pthread {
	union {
	#if TLS_TCB_AT_TP // !TLS_DTV_AT_TP
		tcbhead_t header;
	#else
	struct {
		int multiple_threads;
		int gscope_flag;
	} header;
	#endif
	void *__padding[24];
	};

	struct {
		void *prev, *next;
	} list;
	pid_t tid; /* the field we are looking for! */
}

#endif
};

/* #if TLS_DTV_AT_TP */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note this is taken from glibc's nptl/descr.h.

@kolyshkin
Copy link
Contributor

The only question remains, do we need this kludge for all Go versions of only for Go 1.22+. I tentatively vote for the latter, based on "don't fix if not broken" principle.

Can do this adding a go file like this:

//go:build go1.22

package nsenter

// #cgo CFLAGS: -DRUNC_GLIBC_TID_KLUDGE=1
import "C"

@kolyshkin
Copy link
Contributor

@cyphar shall we move this out of draft?

libcontainer/nsenter/nsexec.c Outdated Show resolved Hide resolved
else
write_log(DEBUG, "clone: find_glibc_tls_tid_address: PR_GET_TID_ADDRESS failed: %m");
write_log(WARNING,
"clone: find_glibc_tls_tid_address: falling back to scanning pthread_self(). Please use a kernel with CONFIG_CHECKPOINT_RESTORE=y.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to add a ci test for CONFIG_CHECKPOINT_RESTORE!=y?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to implement as for that we need a kernel with CONFIG_CHECKPOINT_RESTORE not set

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. Maybe we can use an env to skip prctl(PR_GET_TID_ADDRESS) when testing, for example:
950ff28

* cover every architecture without a huge risk of SIGSEGV.
*/
int i;
for (i = 0; i < 1024; i += sizeof(pid_t))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should remove this fail back? Because this may get wrong offset.

Since glibc 2.25, the thread-local cache of the current TID is no longer
updated in the child when calling clone(2). This results in very
unfortunate behaviour when Go does pthread calls using pthread_self(),
which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value.
Unfortunately (and unsurprisingly), the layout of "struct pthread" is
strictly private and could change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with
the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as
long as runc is using glibc, when "runc init" is spawned the child
process will have a pointer directly to the cached value we want to
change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later,
we can simply use prctl(PR_GET_TID_ADDRESS).

For older kernels we need to memory scan the TLS structure
(pthread_self() is a pointer to the head of the TLS structure). However,
to avoid false positives we first try known-correct offsets based on the
current structure layouts. If that fails, we scan the 1K block for any
fields that might match. When doing the scan, we assume that the first
field we find that contains the actual TID of the current process is the
field we want.

Obviously this is all very horrific, and if you are reading this in the
future, it almost certainly has caused some horrific bug that I did not
forsee. Sorry about that. As far as I can tell, there is no other
workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID
behaviour of glibc in some way. We cannot "just" do a re-exec after
clone(2) for security reasons.

Sadly, this is all glibc-specific. musl doesn't even allow you to use
CLONE_CHILD_CLEARTID (and they use a different address for the TID
anyway). We could do the memory scan and manually overwrite the address
after clone(2), but we can deal with that in the future if it turns out
people use non-glibc builds and need this fix.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
@AkihiroSuda
Copy link
Member

Is this still WIP?

@cyphar cyphar marked this pull request as ready for review May 11, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runc doesn't work with go1.22
6 participants